SIMULATING NATURAL APPRECIATION OF PHYSICAL SPACE
Thoughts in progress on AI and Multi-sensory Spatial Experience
by
Bbumba Emmanuel Ezekiel and Mukabire Joel
24th February 2025
SIMULATING NATURAL APPRECIATION OF PHYSICAL SPACE
Thoughts in progress on AI and Multi-sensory Spatial Experience
by
Bbumba Emmanuel Ezekiel and Mukabire Joel
24th February 2025
ABSTRACT
This paper addresses the challenge of making a Machine—that is, Artificial Intelligence—more spatially useful by mimicking the human mind’s ability to focus on specific pieces of sensory information (optical, auditory, olfactory, haptic, gustatory, etc.) within a focus locus. In humans, this focus locus is always specially flawed; its perfect incompleteness is what facilitates useful spatial awareness. We propose a series of experiments that test whether various AI models can utilise this spatial incompleteness when provided with deliberately incomplete data. These experiments examine: (a) whether large language models (LLMs) can fill in missing multisensory aspects in detailed spatial descriptions; (b) if image-to-image generation models can coherently predict omitted visual and contextual information from specially imperfect images; and (c) whether simulated 3D environment models can reconstruct the multisensory perception at each vertex when some of the sensory information is erased. Models such as OpenAI’s GPT-4o, O1, O3Mini, Claude 3.5, Grok, and DeepSeek R1 would serve as the best test cases due to their advanced general abilities. This paper details the experimental design, methodology, and the underlying rationale in order to achieve an AI that better harmonises with human spatial experience—minimising intrusion into natural behaviour while maximising the machine’s capacity to assist its user.
INTRODUCTION
Making a machine to become truly perceptually conscious may be an achievable feat in the far future but it may be delusional to think that the major practical function being pursued—that is, the embodiment of true empathy for the human condition in the machine—can only be achieved that way.
We can start by focusing our attention on the physical space in which natural consciousness exists. So far the intelligent machines we are building are limited to the virtual realm. We can interact with them through digital virtual space. If we are to take the premise that machines are meant to help us in some way, no matter the application, then it is arguably important to narrow the gap between the virtual and the physical. One way to do this is by transporting the human senses to the virtual realm as with the endeavours of virtual reality and its assorted sensory mediation gear. The other way is to try and bring the machine to the physical realm as is exemplified in the efforts of robotics and augmented reality. The latter is arguably more friendly to our lifestyle given that we are trying to get the machines to help us; it is wiser to do less tweaking of human behaviour than of machine behaviour.
The uncanny valley between the real and the mimicking virtual, and the burden of our collective conscience being beleaguered by an impending dystopian disaster of gear and intrusive implanted chips—and even the prospect of a possible upload someday—are all ominous tales popular to our regard of machine use. It is safe to say that this essay is in favour of keeping the human condition as intact as possible with very little modification to behaviour and lifestyle. The cumulative experiences of all humanity, the speed at which we can absorb new stimuli and changes and incorporate them with the cumulative experiences we normally call culture, all stand in the way of radical adaptation to work better with the machine. We therefore should do little tweaking the human and more the machine.
We need to enable the machine to fully experience physical space the way we do, and rely less on a simulation of physical space as virtual space. The underlying inspiration of this paper is primarily on just that—to discuss and impress upon the reader how much is possible in enabling machines to harmoniously inhabit (or better experience) our physical world for the purpose of assisting us in various ways while causing as little intrusion into our natural behaviour as possible.
Let us now define human experience of space as it will provide the ingredients for the resulting translation for the machine in the thread of thought supporting this paper. The human mind (which is itself a natural virtual entity) absorbs information through its sensory organs that capture the experiential spheres—from the largest to the smallest: Optical, Auditory, Olfactory, Haptic, and Gustatory. Two points of value emerge here: the hierarchy of the experiential spheres (Focus Loci for each sense) which helps the perceiver avoid an overload of irrelevant information, and focus—the accumulation of select content into a core that represents the most valuable information. This core remains the central focus as it is modified with each memory retrieval cycle. As such the ultimate goal of the experiments proposed in the discussions of this paper deliberately moves to embrace a value-based space experience as the best way to teach the machine to experience our world.
LITERATURE REVIEW
Current Generative AI is hyper-creative and hyper-productive. However, despite its remarkable capabilities, AI remains limited by the data upon which it is trained. For instance, a model might understand native low resource languages from merely partial exposure and yet not be able to speak it at all. This arouses broader questions about AI synthesis and the notions of imitation versus true understanding—as discussed by Mario Carpo in his treatise on “Imitation, Style and the Eternal Return of the Precedent.” Carpo argues that a re-examination of imitation would expose the raw nature of AI’s complex copying operations.
Furthermore, while vision models or multimodal-models like GPT4o and Gemini 2.0 can recognise and summarise the contents of an image robustly, they struggle with grasping spatial relationships and depth perception—the dynamic interactions between objects in space. Although these models can describe an image’s objects accurately, the inter-object positions and the inherent spatial depth are often misinterpreted. This is particularly problematic when spaces are experienced in a multi-sensory context, where qualitative aspects such as mood, tactile cues, and other implicit sensory cues contribute to our understanding of a space.
Some notable mentions of variant threads of thought that still link to the experiments are the following:
Fractal systems have been proposed as a model for understanding how recursion in structures might mimic layered spatial experiences, much the same way as a sculptor’s relief might encode layered meanings.
An emerging perspective is the need for a “spatial experience model” (SEM) of AI—one that internalises not only visual data but also multisensory and psycho-emotional cues.
Existing attempts, such as multi-modality in models like GPT-4o, remain underutilised partly due to rigid compartmentalisation of sense-specific data. A holistic AI, more mindful of focus of sensory cues, could bridge the gap between virtual encounters and true physical presence, ensuring that the machine assists rather than overwhelming its user.
PROPOSED METHODOLOGIES
The overarching aim is to test various AI models for their ability to overcome spatial incompleteness—the deliberate omission of specific sensory details—by simulating the human mind’s focus locus. The following experiments are proposed:
A. EXPERIMENT 1: TESTING LLMs WITH DELIBERATELY INCOMPLETE SPATIAL DESCRIPTIONS
1. Method Rationale
In human spatial experience, focus is achieved by selectively neglecting and then intuitively filling in missing sensory aspects. Here, the Large Language Models (LLMs) will be provided with detailed descriptions of spaces from which certain key sensory details (e.g., tactile or olfactory information) corresponding to the “special focus locus” have been purposefully omitted.
2. Experiment Design
• The prompt will include a comprehensive qualitative description of space in terms of multi-sensory perception, yet will erase one or two critical elements (e.g., the tactile textures of a stone surface or the olfactory hints of a rain-soaked environment).
• Example: “Describe a sunlit courtyard with verdant greenery, softly murmuring fountains, and the faint scent of earth—but omit explicit mention of the tactile sensations one might feel underfoot.”
• The LLMs (e.g., GPT-4o, O1, O3Mini, Claude 3.5, Grok, DeepSeek R1) will then be evaluated based on how accurately and coherently they predict the omitted sensory aspects.
3. Evaluation Metrics
• Congruence of filled sensory details with likely human experiences from various backgrounds.
• The balance between added detail and maintaining the intentional blank canvas for user interpretation.
• Qualitative user studies comparing human descriptions of the same space.
B. EXPERIMENT 2: IMAGE-TO-IMAGE GENERATION FROM SPECIALLY INCOMPLETE IMAGES
1. Method Rationale
Just as LLMs can be tested with incomplete text descriptions, image-to-image generation models can be tested with images that have been deliberately stripped of some contextual or sensory details. The hypothesis is that by providing hints regarding the complete qualitative, multi-sensory perception of a scene, the models will be challenged to recreate the omitted aspects.
2. Experiment Design
• Models will be fed with a satellite image with certain segments erased or a landscape image with missing context (e.g., an erased section that would normally denote density of foliage or water reflections).
• The image-to-image prompt will include detailed textual hints that describe what is missing—for example, “In the missing section, imagine a gently rippling water surface that reflects the rich hues of sunset and carries the soft murmur of a nearby stream.”
• Furthermore, experiments with frame-based video generations and video-frame-based 3D AI models will adopt the same framework: testing if the produced frames maintain coherent multisensory and spatial continuity.
3. Evaluation Metrics
• Visual coherence between generated content and the intact parts of the image.
• Multi-sensory consistency as inferred by evaluators—does the generated content evoke the intended sensory experience?
C. EXPERIMENT 3: PREDICTION OF MULTI-SENSORY PERCEPTIONS IN SIMULATED 3D ENVIRONMENTS
1. Method Rationale
In simulated 3D environments, where vertices represent actual points in physical space, each point can be conceived as a sensor node capturing camera, microphone, and tactile data. By erasing select sensory channels, the experiment tests the AI model’s capacity to “fill in” the multisensory perception at that point, analogous to human intuition within the focus locus.
2. Experiment Design
• Each vertex in the 3D model will be temporarily converted into a virtual sensor—imaging, audio, and touch.
• A controlled loss will be induced in one or more sensory input channels (e.g., omitting the tactile feedback corresponding to rough surfaces).
• A composite prompt with hints about the overall qualitative multi-sensory experience of the space will guide the AI to predict the missing information.
• The experiment will integrate multiple image-to-image implementations for cross-validation of the model’s predictions.
3. Evaluation Metrics
• Accuracy of predicted sensory inputs versus known data from intact points.
• Consistency of spatial relationships and multisensory mappings across various vertices.
• Comparative studies with human subject elucidation on the predicted versus expected sensory details.
Additional Considerations
• The experiments underscore two core parameters: focus and speed. The machine must judge when to provide a succinct response that leaves a “blank canvas” and when to elaborate based on user engagement—mirroring the human mind’s use of selective attention and memory retrieval.
• The methodology also echoes the principle that creative users should be allowed to build intuitively. The AI’s role is to empower the unique creative process, akin to an artist who gradually reveals a masterpiece without overloading the canvas on the first stroke.
CONCLUSION AND RECOMMENDATIONS
The experiments proposed illustrate an avenue toward developing AI systems that are not solely hyper-productive but also spatially aware in a human-like manner. By deliberately integrating imperfect, incomplete sensory data into their inputs, AI models may learn to predict omitted or “blank” portions in a coherent, multi-sensory, and contextually appropriate manner. Our experiments—involving large language models, image-to-image generation frameworks, and simulated 3D environment models—are designed to assess whether AI can replicate the human mind’s focus locus, where perfect incompleteness leads to enriched spatial awareness.
By teaching the AI systems to accurately predict the multisensory attributes of our world, we might be able to build models that can indeed perfectly experience and indulge in our world as they serve us —machines might truly become partners who enhance our creative and everyday experiences.
It is recommended that further research be conducted along the following lines:
• Expanded user studies to precisely gauge the effectiveness of the predicted sensory additions in real-world scenarios.
• Integration with hardware devices (e.g., VR headsets, tactile sensors, eye trackers) to capture human responses and refine the multisensory experiential mappings.
• Longitudinal evaluations comparing the performance of proposed models to current state-of-the-art methods, ensuring that the focus remains on enhancing human-machine collaboration with minimal intrusion into the natural human condition.
Bbumba Emmanuel Ezekiel and Mukabire Joel
24th February 2025
REFERENCES
Ø Carpo, M. (n.d.). Imitation, style and the eternal return of the precedent [Video file]. Harvard GSD, John Hejduk Soundings Lecture. Retrieved from https://www.youtube.com/watch?v=BCfxP2N-rDs
Ø DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning. arXiv preprint arXiv:2501.12948.
Ø OpenAI. (n.d.). Learning to Navigate in Complex Environments with Curiosity-Driven Reinforcement Learning. Retrieved from https://openai.com/research/pathmind/
Ø Demers, C. M. H., & Potvin, A. (2017). Erosion in architecture: A tactile design process fostering biophilia. Retrieved from https://www.tandfonline.com/doi/full/10.1080/00038628.2017.1336982