论文成果 / Publications
2023
Modeling the Trade-off of Privacy Preservation and Activity Recognition on Low-Resolution Images
Abstract
A computer vision system using low-resolution image sensors can provide intelligent services (e.g., activity recognition) but preserve unnecessary visual privacy information from the hardware level. However, preserving visual privacy and enabling accurate machine recognition have adversarial needs on image resolution.Model-ing the trade-off of privacy preservation and machine recognition performance can guide future privacy-preserving computer vision systems using low-resolution image sensors. In this paper, using the at-home activity of daily livings (ADLs) as the scenario, we first obtained the most important visual privacy features through a user survey. Then we quantified and analyzed the effects of image resolution on human and machine recognition performance in activ-ity recognition and privacy awareness tasks. We also investigated how modern image super-resolution techniques influence these effects. Based on the results, we proposed a method for modeling the trade-off of privacy preservation and activity recognition on low-resolution images.
EarCough: Enabling Continuous Subject Cough Event Detection on Hearables
Abstract
Cough monitoring can enable new individual pulmonary health applications.Subject cough event detection is the foundation for continuous cough monitoring.Recently,the rapid growth in smart hearables has opened new opportunities for such needs. This pa-per proposes EarCough, which enables continuous subject cough event detection on edge computing hearables,by leveraging the always-on active noise cancellation (ANC)microphones. Specifi-cally,we proposed a lightweight end-to-end neural network model -EarCoughNet.To evaluate the effectiveness of our method,we constructed a synchronous motion and audio dataset through a user study. Results show that EarCough achieved an accuracy of 95.4%and an Fl-score of 92.9% with a space requirement of only 385 kB. We envision EarCough as a low-cost add-on for future hearables to enable continuous subject cough event detection.
Enabling Voice-Accompanying Hand-to-Face Gesture Recognition with Cross-Device Sensing
Abstract
Gestures performed accompanying the voice are essential for voice interaction to convey complementary semantics for interaction purposes such as wake-up state and input modality. In this paper, we investigated voice-accompanying hand-to-face(VAHF)gestures for voice interaction. We targeted on hand-to-face gestures because such gestures relate closely to speech and yield significant acous-tic features (e.g., impeding voice propagation). We conducted a user study to explore the design space of VAHF gestures, where we first gathered candidate gestures and then applied a structural analysis to them in different dimensions(e.g.contact position and type), outputting a total of 8 VAHF gestures with good usability and least confusion. To facilitate VAHF gesture recognition, we proposed a novel cross-device sensing method that leverages het-erogeneous channels(vocal,ultrasound,and IMU) of data from commodity devices(earbuds,watches,and rings).Our recognition model achieved an accuracy of 97.3% for recognizing 3 gestures and 91.5%for recognizing 8 gestures(excluding the "empty"gesture). proving the high applicability. Quantitative analysis also shed light on the recognition capability of each sensor channel and their dif-ferent combinations. In the end, we illustrated the feasible use cases and their design principles to demonstrate the applicability of our system in various scenarios.
GazeReader: Detecting Unknown Word Using Webcam for English as a Second Language (ESL) Learners
Abstract
Automatic unknown word detection techniques can enable new ap-plications for assisting English as a Second Language(ESL)learners, thus improving their reading experiences. However,most modern unknown word detection methods require dedicated eye-tracking devices with high precision that are not easily accessible to end-users. In this work, we propose GazeReader, an unknown word detection method only using a webcam. GazeReader tracks the learner's gaze and then applies a transformer-based machine learn-ing model that encodes the text information to locate the unknown word. We applied knowledge enhancement including term fre-quency, part of speech, and named entity recognition to improve the performance. The user study indicates that the accuracy and F1-score of our method were 98.09% and 75.73%,respectively.Lastly, we explored the design scope for ESL reading and discussed the findings.
Selecting Real-World Objects via User-Perspective Phone
Abstract
Perceiving the region of interest (ROI) and target object by smart-phones from the user’s first-person perspective can enable diverse spatial interactions. In this paper, we propose a novel ROI input method and a target selecting method for smartphones by utilizing the user-perspective phone occlusion. This concept of turning the phone into real-world physical cursor benefits from the proprioception, gets rid of the constraint of camera preview, and allows users to rapidly and accurately select the target object. Meanwhile, our method can provide a resizable and rotatable rectangular ROI to disambiguate dense targets. We implemented the prototype system by positioning the user’s iris with the front camera and estimating the rectangular area blocked by the phone with the rear camera simultaneously, followed by a target prediction algorithm with the distance-weighted Jaccard index. We analyzed the behavioral models of using our method and evaluated our prototype system’s pointing accuracy and usability. Results showed that our method
is well-accepted by the users for its convenience, accuracy, and
efciency.
From 2D to 3D: Facilitating Single-Finger Mid-Air Typing on QWERTY Keyboards with Probabilistic Touch Modeling
Abstract
Mid-air text entry on virtual keyboards suffers from the lack of tactile feedback, which brings challenges to both tap detection and input prediction. In this paper, we explored the feasibility of single-finger typing on virtual QWERTY keyboards in mid-air. We first conducted a study to examine users’ 3D typing behavior on different sizes of virtual keyboards. Results showed that the participants perceived the vertical projection of the lowest point on the keyboard during a tap as the target location and inferring taps based on the intersection between the finger and the keyboard was not applicable. Aiming at this challenge, we derived a novel input prediction algorithm that took the uncertainty in tap detection into a calculation as probability, and performed probabilistic decoding that could tolerate false detection. We analyzed the performance of the algorithm through a full-factorial simulation. Results showed that the SVM-based probabilistic touch detection together with a 2D elastic probabilistic decoding algorithm (elasticity = 2) could achieve the optimal top-5 accuracy of 94.2%. In the evaluation user study, the participants reached a single-finger typing speed of 26.1 WPM with 3.2% uncorrected word-level
error rate, which was significantly better than both tap-based and gesture-based baseline techniques. Also, the proposed technique received the highest preference score from the users, proving its usability in real text entry tasks.
ConeSpeech: Exploring Directional Speech Interaction for Multi-Person Remote Communication in Virtual Reality
Abstract
Remote communication is essential for efficient collaboration among people at different locations. We present ConeSpeech, a virtual reality (VR) based multi-user remote communication technique, which enables users to selectively speak to target listeners without distracting bystanders. With ConeSpeech, the user looks at the target listener and only in a cone-shaped area in the direction can the listeners hear the speech. This manner alleviates the disturbance to and avoids overhearing from surrounding irrelevant people. Three featured functions are supported, directional speech delivery, size-adjustable delivery range, and multiple delivery areas, to facilitate speaking to more than one listener and to listeners spatially mixed up with bystanders. We conducted a user study to determine the modality to control the cone-shaped delivery area. Then we implemented the technique and evaluated its performance in three typical multi-user communication tasks by comparing it to two baseline methods. Results show that ConeSpeech balanced the convenience and flexibility of voice communication.
Enabling Real-Time On-Chip Audio Super Resolution for Bone-Conduction Microphones
Abstract
Voice communication using an air-conduction microphone in noisy environments suffers from the degradation of speech audibility. Bone-conduction microphones (BCM) are robust against ambient noises but suffer from limited effective bandwidth due to their sensing mechanism. Although existing audio super-resolution algorithms can recover the high-frequency loss to achieve high-fidelity audio,they require considerably more computational resources than is available in low-power hearable devices. This paper proposes the first-ever real-time on-chip speech audio super-resolution system for BCM.To accomplish this,we built and compared a series of lightweight audio super-resolution deep-learning models. Amongall these models,ATS-UNet was the most cost-efficient because the proposed novel Audio Temporal Shift Module (ATSM) reduces the network's dimensionality while maintaining sufficient temporal features from speech audio.Then,we quantized and deployed the ATS-UNet to low-end ARMmicro-controller units for a real-time embedded prototype. The evaluation results show that our system achieved real-time inference speed on Cortex-M7 and higher quality compared with thebaseline audio super-resolution method.Finally, we conducted a user study with ten experts andten amateur listeners to evaluate our method's effectiveness to human ears. Both groups perceived a significantly higher speech quality with our method when compared to the solutions with theoriginal BCM or air-conduction microphone with cutting-edge noise-reduction algorithms.
HandAvatar: Embodying Non-Humanoid Virtual Avatars through
Abstract
We propose HandAvatar to enable users to embody non-humanoid avatars using their hands. HandAvatar leverages the high dexterity and coordination of users’ hands to control virtual avatars, enabled through our novel approach for automatically-generated joint-to-joint mappings. We contribute an observation study to understand users’ preferences on hand-to-avatar mappings on eight avatars. Leveraging insights from the study, we present an automated approach that generates mappings between users’ hands and arbitrary virtual avatars by jointly optimizing control precision, structural similarity, and comfort. We evaluated HandAvatar on static posing, dynamic animation, and creative exploration tasks. Results indicate that HandAvatar enables more precise control, requires less physical effort, and brings comparable embodiment compared to a state-of-the-art body-to-avatar control method. We demonstrate HandAvatar’s potential with applications including non-humanoid avatar based social interaction in VR, 3D animation composition, and VR scene design with physical proxies. We believe thatHandAvatar unlocks new interaction opportunities, especially for usage in Virtual Reality, by letting users become the avatar in applications including virtual social interaction, animation, gaming, or education.