AirCopBench

A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Paper 📎 Appendix Code 🤗 Datasets

The benchmark is designed to evaluate whether vision-language models (VLMs) can process multi-UAV collaborative visual data for question answering, covering perception, reasoning, and decision-making in complex scenarios.

AirCopBench at a Glance

Overview of AirCopBench
Figure Overview. Illustration of multi-drone collaborative perception under diverse degradation alongside performance of six representative MLLMs versus human and random baselines.
AirCopBench task taxonomy
Task Taxonomy. Four evaluation dimensions with fourteen task types covering scene understanding, object understanding, perception assessment, and collaborative decision making.
AirCopBench generation pipeline
Dataset Pipeline. Four-stage creation process including data collection, annotation, question generation, and quality control to ensure reliable benchmarks.
AirCopBench statistics
Statistical Overview. Distributions of VQA pairs across task types, data sources, UAV group counts, and perception degradation categories.
Task correlation matrix
Task Correlation. Correlation coefficients across all tasks highlight similarities in cognitive requirements among MLLMs.
Common MLLM reasoning errors
Common Errors. Typical reasoning failures observed in aerial collaborative perception scenarios for current MLLMs.

To better illustrate the proposed dataset, we present VQA examples along with corresponding multi-view images captured by multiple UAVs in the real world (Liu et al. 2023), the EmbodiedCity simulator (Gao et al. 2024), the Coperception-UAV dataset (Hu et al. 2022), and the AeroCollab3D dataset (Tian et al. 2024), covering all 14 task types across 4 dimensions. The dataset examples from the 4 data sources are shown below respectively.

View Dataset
1 UAV1 Image
2 UAV2 Image

Question Type: Scene Understanding - Scene Comparison

Question: What is a notable difference in the types of objects visible near the road in UAV1 compared to UAV2?

Options:

  • A: UAV1 shows more trees near the road, while UAV2 shows more buildings.
  • B: UAV1 shows a bus stop structure, which is absent in UAV2.
  • C: UAV2 shows a red ground parking lot with numerous cars, which is not visible in UAV1.
  • D: UAV1 includes a pedestrian crossing sign, missing in UAV2.

Answer: C

3 UAV Image
4 UAV Image

Question Type: Object Understanding - Object Grounding

Question: Where is the large green landscaped area located relative to the multi-lane road?

Options:

  • A: The landscaped area is adjacent to the right side of the road.
  • B: The landscaped area is directly under the road.
  • C: The landscaped area is on the left side, separated by buildings.
  • D: The landscaped area is in the middle of the road.

Answer: C

5 UAV Image
6 UAV Image

Question Type: Object Understanding - Object Recognition

Question: Which vehicle is positioned directly behind the yellow bus across the intersection?

Options:

  • A: The red car near the left edge of the image.
  • B: The white car approaching the intersection from the bottom.
  • C: The black car in the middle lane.
  • D: The teal car on the far right.

Answer: B

7 UAV Image
8 UAV Image

Question Type: Object Understanding - Object Counting

Question: Based on the image analysis, how many targets (vehicles, pedestrians, bicycles) can be observed in UAV2's perspective?

Options:

  • A: 32
  • B: 35
  • C: 31
  • D: 34

Answer: D

9 UAV Image
10 UAV Image

Question Type: Perception Assessment - Causal Assessment

Question: What main factor might affect object detection in the scene of UAV2?

Options:

  • A: High noise in the image.
  • B: Occlusion by large vehicles.
  • C: Low resolution due to distance.
  • D: Overexposure from bright lighting.

Answer: A

11 UAV Image
12 UAV Image

Question Type: Perception Assessment - Usability Assessment

Question: Is the image captured by UAV1 usable for target perception tasks?

Options:

  • A: Yes, barely usable.
  • B: Yes, highly usable.
  • C: No, not usable.
  • D: Yes, partially usable.

Answer: B

13 UAV Image
14 UAV Image

Question Type: Perception Assessment - Quality Assessment

Question: What is the perception quality assessment score (1-5) for the image captured by UAV1?

Options:

  • A: 5
  • B: 2
  • C: 1
  • D: 3

Answer: D

15 UAV Image
16 UAV Image

Question Type: Collaborative Decision - When to Collaborate

Question: Should UAV1 collaborate with another UAV to address need for collaboration due to incomplete information?

Options:

  • A: Yes, due to partial occlusion of key objects.
  • B: No, the scene is fully visible.
  • C: Yes, due to poor visibility of the objects.
  • D: No, all objects are clearly captured.

Answer: A

17 UAV Image
18 UAV Image

Question Type: Scene Understanding - Scene Description

Question: How do the vehicles interact with the pedestrian crossing in this nighttime scene?

Options:

  • A: Vehicles are stopped before the crossing, indicating respect for pedestrian safety.
  • B: Vehicles are actively passing over the crossing, suggesting no pedestrians are present.
  • C: Vehicles are parked directly on the crossing, obstructing any pedestrian movement.
  • D: Vehicles are turning around the crossing, avoiding direct interaction.

Answer: B

19 UAV Image
20 UAV Image

Question Type: Object Understanding - Object Matching

Question: Which object in UAV2 corresponds to the pedestrian crossing the road near the center of UAV1's view?

Options:

  • A: The pedestrian now walking along the sidewalk in UAV2.
  • B: The cyclist riding near the road edge in UAV2.
  • C: The pedestrian standing near a parked car in UAV2.
  • D: The pedestrian crossing the road near a traffic light in UAV2.

Answer: A

21 UAV Image
22 UAV Image

Question Type: Collaborative Decision - Why to Collaborate

Question: Why should UAV1 collaborate with another UAV?

Options:

  • A: To enhance the visibility of distant or small targets that are not visible in its current perspective.
  • B: To capture a wider geographical area for broader context.
  • C: To adjust for varying weather conditions affecting image clarity.
  • D: To provide real-time data for immediate decision-making.

Answer: A

23 UAV Image
24 UAV Image
25 UAV Image

Question Type: Scene Understanding - Scene Description

Question: What is the most likely primary monitoring priority for the UAV in this scene?

Options:

  • A: Tracking the movement of a drone around the structure.
  • B: Monitoring the traffic flow on the adjacent road.
  • C: Observing the pedestrian activity on the sidewalks.
  • D: Identifying bicycles traveling near the intersection.

Answer: A

26 UAV Image
27 UAV Image
28 UAV Image

Question Type: Perception Assessment - Causal Assessment

Question: What is the primary cause affecting the detection of vehicles, pedestrians, and bicycles in the image of UAV1?

Options:

  • A: Overexposure due to sunlight causing glare.
  • B: Color blending between targets and surroundings.
  • C: Occlusion by dense foliage obstructing visibility.
  • D: Motion blur caused by moving targets.

Answer: C

29 UAV Image
30 UAV Image
31 UAV Image

Question Type: Collaborative Decision - Why to Collaborate

Question: What is the main reason UAV2 should collaborate with other UAVs in this scenario?

Options:

  • A: To obtain a larger and clearer view of the target that appears too small or distant.
  • B: To compensate for shadowing effects in the image caused by structural elements.
  • C: To reduce time required for multi-perspective object labeling.
  • D: To synchronize flight paths for uniform coverage of the area.

Answer: A

32 UAV Image
33 UAV Image
34 UAV Image

Question Type: Object Understanding - Object Matching

Question: The hovering drone visible near the center of the intersection in UAV3's perspective corresponds to which target in another UAV's aerial view?

Options:

  • A: The drone partially obscured by a tree, hovering along the sidewalk in UAV2.
  • B: The drone seen flying above the treetops, captured clearly in UAV1's perspective.
  • C: The drone stationary and seen near a traffic light on the corner of the street in UAV2.
  • D: The drone flying low above an empty crosswalk, visible near a parked vehicle in UAV1.

Answer: A

35 UAV Image
36 UAV Image
37 UAV Image

Question Type: Object Understanding - Object Counting

Question: Understanding - Object Counting From the UAV1's aerial perspective, how many targets (drones, vehicles, pedestrians) can be detected in UAV1's field of view?

Options:

  • A: 2
  • B: 0
  • C: 1
  • D: 3

Answer: D

38 UAV Image
39 UAV Image
40 UAV Image

Question Type: Collaborative Decision - Who to Collaborate

Question: Which UAV should UAV2 collaborate with for collaboration partner for complementary perspective in multi-UAV setup?

Options:

  • A: UAV1
  • B: None (no suitable collaboration partner).
  • C: UAV3
  • D: A ground-based sensor.

Answer: A

41 UAV Image
42 UAV Image
43 UAV Image

Question Type: Scene Understanding - Scene Comparison

Question: How does the visibility of target drones differ across the three UAV perspectives?

Options:

  • A: UAV1 captures one visible drone but misses the others due to occlusion from buildings.
  • B: UAV2 provides visibility of two drones with limited occlusion from structural curves.
  • C: UAV3 captures the highest number of drones visible due to its extended angle of view.
  • D: All three UAVs detect an equal number of drones in the scene.

Answer: B

44 UAV Image
45 UAV Image
46 UAV Image

Question Type: Scene Understanding - Scene Description

Question: Based on UAV2's field of view, what is the primary monitoring priority of target movement patterns?

Options:

  • A: Tracking the movement of the drones hovering above the buildings.
  • B: Analyzing pedestrian flow along the sidewalks and open areas.
  • C: Monitoring vehicles traveling along the central road for traffic violations.
  • D: Observing bicycle movements through designated bike lanes.

Answer: A

47 UAV Image
48 UAV Image
49 UAV Image

Question Type: Collaborative Decision - What to Collaborate

Question: What specific object information should UAV2 share with other UAVs about the drone in the marked region?

Options:

  • A: Drone partially obscured by tree leaves in the center of the image that needs position clarification.
  • B: Drone moving near the road crossing in the upper center area that requires trajectory estimation.
  • C: Drone blending with shadows near the bottom edge that requires contrast adjustment.
  • D: Drone hovering above the road near the pedestrian area on the left side that requires height confirmation.

Answer: A

50 Image 50
51 Image 51
52 Image 52
53 Image 53
54 Image 54

Question Type: Scene Understanding - Scene Comparison

Question: Which perspective highlights the presence of buildings near the water, and how do they compare to the other images?

Options:

  • A: Only UAV5 shows buildings near the water.
  • B: UAV3 and UAV5 both show buildings, but UAV5 includes water.
  • C: UAV1 and UAV5 show buildings near the road and water.
  • D: None of the UAV perspectives include buildings near the water.

Answer: A


Question Type: Perception Assessment - Causal Assessment

Question: What is the primary cause of object detection challenges in this UAV3 image?

Options:

  • A: High image noise from sensor failure.
  • B: Occlusion by dense tree placement.
  • C: Motion blur due to UAV movement.
  • D: Overexposure due to excessive sunlight.

Answer: B


Question Type: Object Understanding - Object Grounding

Question: What is the position of the red car relative to the large cluster of trees in the middle of the image?

Options:

  • A: The red car is directly behind the cluster of trees on the curved road.
  • B: The red car is on the curved road to the right of the cluster of trees.
  • C: The red car is located at the far edge of the open grassy area.
  • D: The red car is positioned within the cluster of trees near the largest rock.

Answer: B

55 Image 55
56 Image 56
57 Image 57
58 Image 58
59 Image 59

Question Type: Scene Understanding - Observing Posture

Question: Which UAV perspective better highlights the spatial relationship between the circular building and the main road?

Options:

  • A: UAV1 offers a clearer view due to its higher angle and proximity to the circular building.
  • B: UAV2 provides a clearer view by focusing on the main road and surrounding structures.
  • C: Both UAVs equally highlight the spatial relationship.
  • D: Neither UAV perspective clearly shows the relationship between the building and the road.

Answer: A


Question Type: Collaborative Decision - Who to Collaborate

Question: Which UAV should UAV2 collaborate with for collaboration partner for complementary perspective in multi-UAV setup?

Options:

  • A: UAV1.
  • B: UAV3.
  • C: UAV4.
  • D: UAV5.
  • E: None (no need for collaboration).

Answer: D


Question Type: Collaborative Decision - What to Collaborate

Question: What specific object information should UAV3 share with other UAVs to improve multi-view perception of the scene?

Options:

  • A: Precise positions and movements of the blue car on the main road visible in UAV3's view.
  • B: Architectural details of the silo structures for better object recognition in other UAVs.
  • C: Tree locations along the road to align environmental features across all UAV views.
  • D: Details of vehicle congestion near the circular building visible in UAV2's angle.

Answer: A

60 Image 60
61 Image 61
62 Image 62
63 Image 63
64 Image 64

Question Type: Object Understanding - Object Counting

Question: Based on the image analysis, how many targets (vehicles, pedestrians, bicycles) can be observed in UAV2's perspective?

Options:

  • A: 5.
  • B: 11.
  • C: 9.
  • D: 6.

Answer: C


Question Type: Perception Assessment - Causal Assessment

Question: What is the primary factor affecting perception quality in UAV5's view?

Options:

  • A: Overexposure due to direct sunlight.
  • B: Partial occlusion by foreground tree branches.
  • C: Blur caused by UAV motion.
  • D: Lower resolution due to sensor limitations.

Answer: B


Question Type: Collaborative Decision - Why to Collaborate

Question: What is the main reason UAV2 should collaborate with other UAVs in this scenario?

Options:

  • A: To synchronize overlapping data from identical regions.
  • B: To improve visibility of shadowed objects in UAV2's perspective.
  • C: To generate a shared panoramic image across multiple perspectives.
  • D: To compensate for the small and distant target size visible in UAV2.

Answer: D


Question Type: Collaborative Decision - When to Collaborate

Question: Should UAV4 collaborate with other UAVs to address need for collaboration due to environmental factors across UAVs?

Options:

  • A: Yes, due to low image resolution.
  • B: Yes, due to shadow.
  • C: No, the environment is clear.

Answer: C

đź“‹ Quality Control Process

Dataset Quality Control Measures

The quality control of the dataset involves three main measures:

1. Standard Examination

It evaluates VQA pairs based on four criteria. Required Content checks if all essential information to address the question is included, with incomplete pairs flagged for revision. Format Consistency ensures uniform structure, wording, and presentation across pairs, identifying those deviating from standards for correction. Answer Validity verifies the accuracy and relevance of answer options, filtering out pairs with incorrect or irrelevant ones. Question Length assesses if questions are detailed enough to avoid ambiguity, ensuring comprehensibility. Each criterion is worth 1 point (maximum 4), and only pairs scoring 4 are retained in the final dataset.

2. Blind Filtering

This eliminates questions answerable by common sense without visual input using 3 MLLMs. Questions correctly answered by all MLLMs (relying on general knowledge) or incorrectly by all (flawed/ambiguous) are removed. Only those requiring genuine multi-view visual reasoning (some MLLMs answer correctly without visuals, others not) are retained.

3. Human Refinement

It addresses remaining issues in VQA pairs, including Ambiguous Questions (lacking clear definitions), Invalid Options (no correct answer, duplicates, irrelevant, or indistinct), and Incorrect Answers (wrong, missing, or multiple correct answers). Refinement examples are provided in Tab. 2.

Table 2: Examples of human refinement for generated multi-UAV perception questions

Issues Refinement Examples
Incorrect answers Correct counting errors in target detection Object Counting: Based on the image analysis, how many targets (vehicles, pedestrians, bicycles) can be observed in UAV2’s perspective?
Choices:
A. 21   B. 20   C. 18   D. 19
Original Answer: A
Corrected Answer: D
Ambiguous question Add specific object identifiers to eliminate ambiguity Object Grounding: Where is the gray car located relative to the blue car which is adjacent to it in this scene?
Choices:
A. The gray car is ahead of the blue car in the same lane
B. The gray car is behind the blue car but in a different lane
C. The gray car is adjacent to the blue car in the neighboring lane
D. The gray car is directly in front of the blue car in a parallel lane
Answer: C
Invalid options Replace unmatchable options with valid alternatives Object Matching: Which object in another UAV’s view corresponds to the yellow vehicle in the left lane of the highway observed in UAV3’s view?
Choices:
A. The yellow vehicle now seen merging onto a curved road in UAV2’s view
B. The yellow vehicle now stationary behind a row of parked cars in UAV4’s view
C. The yellow vehicle seen traveling in the middle lane, heading toward an underpass in UAV5’s view
D. The yellow vehicle now seen parked near the trees on the right side of the road in UAV1’s view (No drone perspective can observe)
Answer: D
Invalid options Correct object type mismatches in options Object Matching: Which object in another UAV’s view corresponds to the dark gray sedan traveling in the middle lane in UAV3’s view?
Choices:
A. The dark gray sedan now seen from a closer view in UAV2, traveling in the left lane
B. The black SUV dark gray SUV now seen from above in UAV1, parked near the intersection
C. The dark gray sedan now seen from the side in UAV4, approaching a group of parked cars
D. The silver sedan now seen from the rear in UAV5, moving through a residential area
Answer: C

📊 Evaluated Baselines and Models

The evaluated baselines and models, including both proprietary and open-source Multimodal Large Language Models (MLLMs) trained to handle multi-image inputs, are introduced as follows:

⚙️ Hyperparameters for Training

Hyperparameters for Training

In this section, we present all the hyperparameters we used to training the two kinds of models in Table 4 and Table 5. All the training processes were conducted using LLaMAFactory (Zheng et al. 2024). Regarding image resolution and the number of image tokens, we adhere to the original settings specified by each model.

Table 4: Hyperparameters for training Qwen2.5-VL (7B and 3B) model.

Hyperparameter Value
LoRA Rank8
LoRA α16
LoRA Dropout0.1
LoRA Targetall
GPU4 Ă— NVIDIA A800
Batch Size1
Gradient Accumulation Steps8
Warmup Ratio0.1
Learning Rate1e-4
Learning Rate SchedulerCosine
Unfreeze Vision TowerTrue

Table 5: Hyperparameters for training LLaVA-NeXT-13B model.

Hyperparameter Value
LoRA Rank8
LoRA α16
LoRA Dropout0.1
LoRA Targetall
GPU4 Ă— NVIDIA A800
Batch Size1
Gradient Accumulation Steps8
Warmup Ratio0.1
Learning Rate1e-5
Learning Rate SchedulerCosine
Unfreeze Vision TowerFalse

đź“‹ Details of Question Generation

To produce high-quality VQA pairs, we employ three approaches, model-based, rule-based, and human-based generation, ensuring the validity of our dataset.

Model-based Generation

This section uses LLMs to generate questions for task types requiring high diversity and contextual richness. Specifically, we apply four prompting strategies to guide the model in generating more relevant and coherent questions. The prompting examples and templates for model-based VQA generation are shown in E.4.

• Task Decomposition

We begin by decomposing the overall VQA generation task, which encompasses all 14 collaborative perception tasks, into sub-tasks based on the independent complexity, requirements, and cognitive abilities of each task type. Instead of using a single VQA generation prompt for all tasks, we create specific functions for each task, call them sequentially, and merge their results into the final data structure. This approach enhances the model's success rates, simplifies debugging and retries, and improves the quality of generated questions by allowing the model to focus on each task individually.

• Role-playing

To improve task efficiency and reduce redundancy, we suggest incorporating global rules into the system role's content. This allows the user prompt to focus solely on the specific task at hand. For example, by specifying the system's role as an expert assistant and setting rules like "output must be in JSON," these constraints are applied globally, leaving the user prompt simpler and more concise. This approach not only separates the responsibilities of the system and user prompts but also reduces token consumption by eliminating repetitive instructions in every user request.

• CoT Prompting

For tasks requiring deep visual understanding, such as Scene Description, Scene Comparison, and Causal Assessment, we adopt a two-step CoT Prompting strategy that combines model pre-processing and generation. First, the model generates an intermediate understanding of the images, such as textual descriptions or captions for each UAV's perspective. In the second step, on the basis of these descriptions, the model generates the final question and answer pair that better fits the image content and question type. This approach simplifies the complex "image-to-question" task into two more manageable steps: "image-to-text" and "text-to-question." By generating captions or reasoning first, followed by question creation, this method improves both task manageability and question quality.

• Few-shot Learning

For tasks requiring complex reasoning, such as Object Grounding or Why to Collaborate, incorporating a few-shot learning approach in the prompt can significantly improve model performance. By providing a structured example in a few-shot format, the model learns to generate responses in the desired JSON format and question style. This approach enforces stricter format constraints and guides the model to focus on relevant details, ensuring more accurate and contextually appropriate questions.

Prompt Design

Our prompt design follows a hierarchical structure with main instructions and specialized templates for different question types. The main instruction prompt Fig. 6 establishes the fundamental framework, defining the role as an expert teacher of multi-view perception and specifying the core requirements for multiple-choice question generation, including JSON output format and answer structure.

MAIN INSTRUCTIONS

Role: You are an expert teacher of the ”Multi-view Perception” course, tasked with creating high-quality multiple-choice
questions that test students’ understanding of multi-UAV collaboration.

Goal: To generate multiple-choice questions about collaborative decision-making, object understanding, perception
assessment, and scene understanding from multi-UAV visual content, adhering to specific requirements.

Content Restrictions: Each question must be strictly restricted to and strongly related to the provided visual content
and annotation data.

Question Structure:

• Each question needs the question itself and 4 choices (A, B, C, D)
• There must be only 1 CORRECT answer and 3 wrong answers
• Plausible but Incorrect: The incorrect choices should be reasonable but factually wrong
• The wrong answers should not be too irrelevant
• Answer Placement: The correct answer can be placed at any position among the choices (A, B, C, D)
• Output Format: Must be valid JSON format
Collaborative Decision

When to Collaborate
TASK EXPLANATION: This type of question requires the student to judge when collaboration between multiple
UAVs is necessary based on the current scene analysis. The proper answer should identify situations where information
is incomplete, targets are occluded, or environmental factors require multi-UAV coordination.

TEMPLATE Question: ”When should [UAV ID] initiate collaboration with other UAVs based on the current scene
analysis?”

TEMPLATE Choices: ”When [specific collaboration trigger condition]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ understanding of when collaboration
between multiple UAVs (up to 3) is necessary.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze annotation →determine collaboration need →formulate question →
create options →verify correctness
3. Questions must be based on annotation data
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only

THINKING PROCESS:

1. Analyze the annotation to determine if collaboration is needed
2. Formulate a clear question about the need for collaboration
3. Create 4 distinct options where only one is correct
4. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_when2col_UAV1_1001",
"question_type": "4.1 When to Collaborate (UAV1)",
"question": "When should UAV1 initiate collaboration with
other UAVs based on the current scene analysis?",
"options": {
"A": "When target objects are partially occluded and
require multi-viewpoint verification",
"B": "When the scene is completely clear and all
targets are visible",
"C": "When there are no moving objects in the field
of view",
"D": "When the weather conditions are optimal for
single UAV operation"
},
"correct_answer": "A",
"image_description": "UAV1 shows a drone partially
occluded by a tree, requiring collaboration for
complete target verification"
}
Collaborative Decision

What to Collaborate
TASK EXPLANATION: This type of question requires the student to identify what specific object information should
be shared between multiple UAVs. The proper answer should focus on specific object descriptions with intuitive location
and context details, prioritizing drone detection and tracking as primary targets.

TEMPLATE Question: ”What specific object information should [UAV ID] share with other UAVs about the [tar-
get type] in the marked region?”

TEMPLATE Choices: ”[Target type] [specific condition] in the [location] that needs [information type]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ understanding of what specific object
information should be shared between multiple UAVs (up to 3).”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze images →identify specific object information gaps →formulate
question →create options →verify correctness
3. Questions must be based on actual visual content or provided descriptions
4. Each question should have exactly 4 options (A, B, C, D) with at least one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only
7. Focus on specific object descriptions with intuitive location and context details
8. Use intuitive image locations (upper-left corner, center, near landmarks, etc.) instead of numerical positions
9. Prioritize drone detection and tracking as the primary target in all questions

THINKING PROCESS:

1. Analyze all images to identify specific object information gaps in marked regions across multiple UAV views
2. Focus on drone-related object information as the primary target
3. Identify the focus based on generation index
4. Formulate a clear question about what specific object information to share from marked regions
5. Create 4 distinct options, all related to specific object descriptions with intuitive location and context
6. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_what2col_UAV1_1001",
"question_type": "4.2 What to Collaborate (UAV1)",
"question": "What specific object information should
UAV1 share with other UAVs about the drone in the
marked region?",
"options": {
"A": "Drone occluded by the tree in the upper-left
corner of the image that needs position clarification",
"B": "Drone flying at low altitude near the bottom
edge that requires height verification",
"C": "Drone moving rapidly from left to right across
the center that needs trajectory prediction",
"D": "Drone with similar color to background near the
traffic light that requires contrast enhancement"
},
"correct_answer": "A",
"image_description": "UAV1 shows a drone in the marked
region at position (31.5%, 48.1%) with size 6.3%Ă—3.6%
that is occluded by a tree in the upper-left corner,
requiring detailed position and movement information."
}
Collaborative Decision

Which to Collaborate
TASK EXPLANATION: This type of question requires the student to determine which UAV(s) should be the optimal
collaboration partner in a multi-UAV setup. The proper answer should consider complementary visibility conditions and
the relative strengths or specific needs of the current scenario.

TEMPLATE Question: ”Which UAV should [UAV ID] collaborate with to [specific collaboration goal]?”

TEMPLATE Choices: ”[UAV ID] which [specific advantage or capability]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ understanding of which UAV(s) should
be the collaboration partner in a multi-UAV system (up to 3 UAVs).”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze annotation →determine collaboration partner →formulate question
→create options →verify correctness
3. Questions must be based on annotation data
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only

THINKING PROCESS:

1. Analyze the annotation to identify the collaboration partners
2. Formulate a clear question about the collaboration partner
3. Create 4 distinct options where only one is correct
4. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_which2col_UAV1_1001",
"question_type": "4.3 Which to Collaborate (UAV1)",
"question": "Which UAV should UAV1 collaborate with to
get a better viewing angle of the partially occluded
target in the central area?",
"options": {
"A": "UAV2, which has a clear view of the central area
from its positioning",
"B": "UAV3, which is located at a similar angle with
the same viewing obstruction",
"C": "No collaboration needed as the target is fully
visible",
"D": "All UAVs simultaneously for maximum coverage"
},
"correct_answer": "A",
"image_description": "UAV1 has partially occluded view
of central area target, UAV2 has better positioning
with clear view of target"
}
Collaborative Decision

Why to Collaborate
TASK EXPLANATION: This type of question requires the student to analyze the fundamental reasons and motiva-
tions for collaboration between multiple UAVs. The proper answer should explain the specific benefits of collaboration
decisions and evaluate the specific benefits brought by collaboration.

TEMPLATE Question: ”Why is collaboration necessary between [UAV ID] and other UAVs in this scenario?”

TEMPLATE Choices: ”To [specific collaboration benefit or reason]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ understanding of which UAV(s) should
be the collaboration partner in a multi-UAV system (up to 3 UAVs).”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze annotation →determine collaboration rationale →formulate question
→create options →verify correctness
3. Questions must be based on annotation data
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only

THINKING PROCESS:

1. Analyze the annotation to identify the collaboration partner
2. Formulate a clear question about the collaboration partner
3. Create 4 distinct options where only one is correct
4. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_why2col_UAV1_1001",
"question_type": "4.4 Why to Collaborate (UAV1)",
"question": "Why is collaboration necessary between
UAV1 and other UAVs in this scenario?",
"options": {
"A": "To overcome visual occlusion caused by
environmental obstacles and improve target detection
accuracy",
"B": "To reduce battery consumption by distributing
the workload",
"C": "To increase flight speed and cover more ground
area",
"D": "To test communication systems between UAVs"
},
"correct_answer": "A",
"image_description": "UAV1 encounters visual occlusion
of key targets due to environmental objects, requiring
collaborative input from other UAVs to maintain
complete situational awareness"
}
Object Understanding

Object Recognition
TASK EXPLANATION: This type of question requires the student to identify targets from UAV perspectives, focusing
specifically on drone detection, vehicle recognition, and pedestrian identification. The proper answer should emphasize
the UAV perspective and aerial view characteristics.

TEMPLATE Question: ”From the UAV’s aerial perspective, what type of target is [specific characteristic] in this
scene?”

TEMPLATE Choices: ”[Specific target description] from [UAV perspective characteristic]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”UAV Multi-view Perception”
course. Your role is to create high-quality multiple-choice questions that test students’ ability to identify targets from
UAV perspectives, focusing specifically on drone detection, vehicle recognition, and pedestrian identification.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Focus on UAV-specific target perception: drones, vehicles, pedestrians
3. Questions must emphasize the UAV perspective and aerial view characteristics
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only

THINKING PROCESS:

1. First, describe the key targets visible from the UAV perspective
2. Focus on UAV-specific target types: drones, vehicles, pedestrians
3. Formulate a clear, specific question about target recognition from aerial view
4. Create 4 distinct options where only one is correct
5. Verify the question emphasizes UAV perspective and target perception

EXAMPLE OUTPUT:

{
"question_id": "sim3_OR_UAV1_1001",
"question_type": "2.1 Object Recognition (UAV1)",
"question": "From the UAV’s aerial perspective, what type of
target is most prominently visible in this scene?",
"options": {
"A": "A white delivery van",
"B": "A surveillance drone",
"C": "A pedestrian crossing the road",
"D": "A stationary traffic light"
},
"correct_answer": "A",
"image_description": "The UAV captures a white delivery van
from above, clearly visible on the multi-lane road with
other vehicles nearby."
}
Object Understanding

Object Counting
TASK EXPLANATION: This type of question requires the student to count specific target types in scenes from UAV
perspectives. The proper answer should be based on annotation data and ensure counting accuracy for drones, vehicles,
pedestrians, and bicycles.

TEMPLATE Question: ”From the UAV’s aerial perspective, how many [target type] can be detected in [UAV ID]’s
field of view?”

TEMPLATE Choices: ”[Number] [target type]”

CORE PROMPT STRUCTURE:

[Rule-Based] Generate UAV target counting questions based on
all_samples.json annotation data.
Now generates even if count=0.

EXAMPLE OUTPUT:

{
"question_id": "sim3_OC_UAV1_1001",
"question_type": "2.2 UAV Target Counting (UAV1)",
"question": "From the UAV’s aerial perspective, how many
targets (drones, vehicles, pedestrians) can be detected
in UAV1’s field of view?",
"options": {
"A": "3",
"B": "4",
"C": "5",
"D": "6"
},
"correct_answer": "A",
"source": "Rule-Based from all_samples.json"
}
Object Understanding

Object Grounding
TASK EXPLANATION: This type of question requires the student to understand spatial positions of targets in scenes
from UAV perspectives. The proper answer should analyze relative positional relationships between targets and evaluate
spatial perception capabilities.

TEMPLATE Question: ”Where is the [target type] located relative to [other objects] in [UAV ID]’s field of view?”

TEMPLATE Choices: ”[Target type] [spatial relationship] [reference objects]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”UAV Multi-view Perception”
course. Your role is to create high-quality multiple-choice questions that test students’ understanding of target spatial
positioning from UAV aerial perspectives.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Focus on UAV-specific target positioning: drones, vehicles, pedestrians from aerial view
3. Questions must emphasize the UAV’s spatial perception capabilities
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only

THINKING PROCESS:

1. First, describe the key targets and their spatial positions from the UAV perspective
2. Focus on UAV-specific target types: drones, vehicles, pedestrians
3. Formulate a clear, specific question about target grounding from aerial view
4. Create 4 distinct options where only one is correct
5. Verify the question emphasizes UAV spatial perception

EXAMPLE OUTPUT:

{
"question_id": "sim3_OG_UAV1_1001",
"question_type": "2.3 Object Grounding (UAV1)",
"question": "Where is the drone located relative to other
objects in UAV1’s field of view?",
"options": {
"A": "Above the intersection, hovering near the traffic
light",
"B": "Behind the building, partially obscured from view",
"C": "On the ground near the sidewalk",
"D": "Inside the vehicle on the road"
},
"correct_answer": "A",
"image_description": "The drone is positioned above the
intersection, hovering near the traffic light structure."
}
Object Understanding

Object Matching
TASK EXPLANATION: This type of question requires the student to match identical targets across multi-UAV per-
spectives, analyzing the impact of viewpoint changes on target appearance. The proper answer should focus on appear-
ance differences caused by viewpoint changes rather than simple recognition.

TEMPLATE Question: ”The [target description] seen from [perspective1] in [UAV1]’s view appears as what in
[UAV2]’s perspective?”

TEMPLATE Choices: ”[Target description] seen from [perspective2] with [specific changes]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”UAV Multi-view Perception”
course. Your role is to create high-quality multiple-choice questions that test students’ ability to match targets across
multiple UAV perspectives, focusing on drone detection, vehicle tracking, and pedestrian identification.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Focus on UAV-specific target matching: drones, vehicles, pedestrians across aerial views
3. Questions must emphasize the UAV’s multi-perspective target tracking capabilities
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only

A simple question like ”What is the white truck in image 1?” with the answer ”The white truck in image 2” is USELESS.
Avoid this.
Instead, follow this reasoning process to create a high-quality question:

THINKING PROCESS:

1. Identify a Candidate Target: In the first image (uav id), find a distinct target (drone, vehicle, pedestrian) that is also
clearly visible in one of the subsequent images (other UAVs). Let’s call this the ”target object”.
2. Analyze the Change: Critically compare the target object’s appearance and context between the UAV views. Focus
on what has CHANGED. Examples of changes include:

• Perspective: ”The vehicle seen from the side” in [uav id] is now ”seen from the rear” in another UAV view.
• Relative Position: ”The car behind the bus” in [uav id] is now ”the car beside a red sedan” in another UAV view.
• Action/State: ”The pedestrian walking towards the crosswalk” in [uav id] is now ”the pedestrian waiting at the
crosswalk” in another UAV view.
• Occlusion: ”The partially occluded blue car” in [uav id] is now ”fully visible” in another UAV view.

3. Formulate Question: Ask about the target object’s appearance or context in the first image, with the answer being
how it appears in the second image.
4. Create Options: Make all 4 options plausible descriptions of the target object in the second image, with only one
being correct.
5. Verify: Ensure the question tests understanding of perspective changes, not just object recognition.

EXAMPLE OUTPUT:

{
"question_id": "sim3_OM_UAV1_1001",
"question_type": "2.4 Object Matching (UAV1)",
"question": "The red car seen from the side in UAV1’s view
appears as what in UAV2’s perspective?",
"options": {
"A": "A red car seen from the rear with visible
taillights",
"B": "A blue car seen from the front",
"C": "A red car seen from above with roof visible",
"D": "A red car seen from the opposite side"
},
"correct_answer": "A",
"image_description": "UAV1 shows a red car from the side,
while UAV2 shows the same car from the rear with visible
taillights."
}
Perception Assessment

Quality Assessment
TASK EXPLANATION: This type of question requires the student to assess image quality for perception tasks in
multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection. The proper answer should evaluate
factors such as clarity, noise, and distortion that affect target detection.

TEMPLATE Question: ”How would you rate the [quality factor] for detecting [target types] in this scene?”

TEMPLATE Choices: ”[Quality level] with [specific characteristics]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ ability to assess image quality for per-
ception tasks in multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze →identify quality factors →formulate question →create options →
verify correctness
3. Questions must be based on actual visual content or provided description
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only
7. Focus on quality factors that affect detection of drones, vehicles, pedestrians, and bicycles

THINKING PROCESS:

1. First, describe the quality factors (clarity, noise, color balance, etc.) in the image or description
2. Identify the focus based on generation index
3. Formulate a clear, specific question about image quality for target detection
4. Create 4 distinct options where only one is correct
5. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_QA_UAV1_1001",
"question_type": "3.1 Quality Assessment (UAV1)",
"question": "How would you rate the image clarity for
detecting drones and vehicles in this scene?",
"options": {
"A": "Excellent with sharp details on all targets",
"B": "Good with minor blur on some objects",
"C": "Fair with noticeable distortion affecting
detection",
"D": "Poor with significant artifacts obscuring targets"
},
"correct_answer": "A",
"image_description": "The image shows excellent clarity
with sharp details on drones, vehicles, pedestrians,
and bicycles."
}
Perception Assessment

Usability Assessment
TASK EXPLANATION: This type of question requires the student to assess image usability for perception tasks in
multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection and tracking. The proper answer
should evaluate whether images are suitable for specific tasks and consider matching between task requirements and
image characteristics.

TEMPLATE Question: ”Is the image captured by [UAV ID] usable for [specific task]?”

TEMPLATE Choices: ”[Usability level] for [specific reason]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ ability to assess image usability for
perception tasks in multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection and tracking.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze →identify usability factors →formulate question →create options
→verify correctness
3. Questions must be based on actual visual content or provided description
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only
7. Focus on usability factors that affect detection and tracking of drones, vehicles, pedestrians, and bicycles

THINKING PROCESS:

1. First, describe the usability factors (suitability for target detection, tracking, etc.) in the image or description
2. Identify the focus based on generation index
3. Formulate a clear, specific question about image usability for target tasks
4. Create 4 distinct options where only one is correct
5. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_UA_UAV1_1001",
"question_type": "3.2 Usability Assessment (UAV1)",
"question": "Is the image captured by UAV1 usable for
detecting drones, vehicles, pedestrians, and bicycles?",
"options": {
"A": "Yes, highly usable",
"B": "Yes, usable",
"C": "Yes, partially usable",
"D": "No, not usable"
},
"correct_answer": "A",
"source": "Rule-Based from JSON"
}
Perception Assessment

Causal Assessment
TASK EXPLANATION: This type of question requires the student to analyze causes of perception quality issues in
multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection. The proper answer should identify
key factors affecting perception effectiveness and understand impact of causal relationships on perception quality.

TEMPLATE Question: ”What is the primary cause of [perception issue] in [UAV ID]’s image?”

TEMPLATE Choices: ”[Specific cause] [affecting factor]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ ability to analyze causes of perception
quality issues in multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze image →identify potential causes →formulate question →create
options →verify correctness
3. Questions must be based on actual visual content or provided description
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only
7. Focus on causes that affect detection of drones, vehicles, pedestrians, and bicycles

THINKING PROCESS:

1. First, describe the potential causes of perception issues (occlusion, lighting, resolution, etc.) in the image or description
2. Identify the primary cause based on generation index
3. Formulate a clear, specific question about the cause of perception issues for target detection
4. Create 4 distinct options where only one is correct
5. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_CA_UAV1_1001",
"question_type": "3.3 Causal Assessment (UAV1)",
"question": "What is the primary cause of reduced visibility for drones and vehicles in this scene?",
"options": {
"A": "Heavy fog obscuring distant objects",
"B": "Bright sunlight causing glare on metallic surfaces",
"C": "Low resolution making small objects indistinct",
"D": "Camera angle limiting field of view"
},
"correct_answer": "A",
"image_description": "The image shows heavy fog reducing visibility for drones, vehicles, pedestrians, and bicycles."
}
Perception Assessment

Improvement Assessment
TASK EXPLANATION: This type of question requires the student to suggest improvements for perception quality in
multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection. The proper answer should propose
practical methods to enhance perception effectiveness and understand how to mitigate identified issues.

TEMPLATE Question: ”How can the perception quality for detecting [target types] be improved in [UAV ID]’s
image?”

TEMPLATE Choices: ”By [specific improvement method] to [benefit]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”Multi-view Perception” course.
Your role is to create high-quality multiple-choice questions that test students’ ability to suggest improvements for
perception quality in multi-UAV views, with focus on drone, vehicle, pedestrian, and bicycle detection.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Follow a structured thinking process: analyze issues →identify improvement methods →formulate question →create
options →verify correctness
3. Questions must be based on actual visual content or provided description
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only
7. Focus on improvement methods that enhance detection of drones, vehicles, pedestrians, and bicycles

THINKING PROCESS:

1. First, identify the perception issues (poor lighting, occlusion, low resolution, etc.) in the image or description
2. Suggest practical improvement methods
3. Formulate a clear, specific question about improving perception for target detection
4. Create 4 distinct options where only one is correct
5. Verify the question is unambiguous and answerable

EXAMPLE OUTPUT:

{
"question_id": "sim3_IA_UAV1_1001",
"question_type": "3.4 Improvement Assessment (UAV1)",
"question": "How can the perception quality for detecting drones and vehicles be improved in this scene?",
"options": {
"A": "By adjusting the UAV altitude to reduce occlusion",
"B": "By changing the color filter to enhance contrast",
"C": "By increasing the frame rate for better motion capture",
"D": "By adding more UAVs for multi-angle views"
},
"correct_answer": "A",
"image_description": "The image shows occlusion issues that can be improved by adjusting UAV altitude."
}
Scene Understanding

Scene Description
TASK EXPLANATION: This type of question requires the student to describe overall scenes from multi-UAV perspec-
tives, integrating information from multiple views. The proper answer should provide comprehensive scene descriptions
that capture key elements and dynamics.

TEMPLATE Question: ”What is the overall scene description integrating views from all UAVs?”

TEMPLATE Choices: ”[Comprehensive scene description]”

CORE PROMPT STRUCTURE: system prompt = ”You are an expert teacher of the ”UAV Multi-view Perception”
course. Your role is to create high-quality multiple-choice questions that test students’ ability to describe overall scenes
from multi-UAV perspectives, integrating information from multiple views.”

CRITICAL RULES:

1. ALWAYS respond in English only
2. Focus on integrating multi-UAV views for comprehensive scene understanding
3. Questions must emphasize key scene elements: environment, targets, dynamics
4. Each question should have exactly 4 options (A, B, C, D) with only one correct answer
5. Options should be plausible, distinct in meaning, and avoid minor rephrasing
6. Output must be valid JSON format only

THINKING PROCESS:

1. First, integrate descriptions from all UAV views
2. Formulate a clear question about overall scene description
3. Create 4 distinct options where only one is correct
4. Verify the question captures multi-view integration

EXAMPLE OUTPUT:

{
"question_id": "sim3_SD_1001",
"question_type": "4. Scene Description",
"question": "What is the overall scene description integrating views from all UAVs?",
"options": {
"A": "Busy urban intersection with multiple vehicles, pedestrians, and a hovering drone",
"B": "Quiet rural road with few vehicles and no pedestrians",
"C": "Industrial area with heavy machinery and workers",
"D": "Park setting with people walking and bicycles"
},
"correct_answer": "A",
"image_description": "Integrated view shows a busy urban intersection with vehicles, pedestrians, and a drone."
}

📊 Results on AirCopBench

Results on AirCopBench

Table 3: Results on AirCopBench for existing various MLLMs on 14 task types across 4 evaluation dimensions. The best-performing model in each category is highlighted in bold, while the second-best is underlined.

Table 3: Results on AirCopBench