CARLA Evaluation Dataset
for Privacy Preserving
Visual SLAM

Mikiya Shibuya* 1,2

Shinya Sumikura* 1

Ken Sakurada* 1

* The authors assert equal contribution and joint first authorship.

Summary

This dataset contains synthetic image sequences captured in various types of road scenes using CARLA Simulator. All of the sequences have two videos that have partial overlaps between then, mainly for the sake of benchmark of global optimization in Visual SLAM algorithms. In addition, most of the sequences are captured with three types of projection models: perspective, fisheye, and equirectangular. This dataset is initially used for the evaluation in our paper entitled "Privacy Preserving Visual SLAM", which is appeared in ECCV 2020. We make this dataset publicly available for researchers who are interested in Visual SLAM. Although we own its copyright, you can freely use it for research purposes, e.g. benchmark of Visual SLAM algorithms. We request that you cite the following paper if you publish any research results utilizing this dataset.

Mikiya Shibuya, Shinya Sumikura, Ken Sakurada,
Privacy Preserving Visual SLAM,
In proceedings of European Conference on Computer Vision (ECCV), 2020.

Download

Click here to access Google Drive which contains the full dataset (about 6.86GB).

Directory Structure


|
|-- 01/                     # directory for Seq. #01 data
|   |
|   |-- prebuilt/           # seq. for creating prebuilt map
|   |   |-- video_pers.mp4  # video captured with perspective model
|   |   └-- poses.txt       # ground-truth poses
|   |
|   └-- input/              # seq. for LC-VSLAM input
|       |-- video_pers.mp4  # video captured with perspective model
|       └-- poses.txt       # ground-truth poses
|
└-- 02/ (or 03/ - 12/)      # directory for Seq. #02 (or #03 - #12) data
    |
    |-- prebuilt/           # seq. for creating prebuilt map
    |   |-- video_pers.mp4  # video captured with perspective model
    |   |-- video_fish.mp4  # video captured with fisheye model
    |   |-- video_equi.mp4  # video captured with equirectangular model
    |   └-- poses.txt       # ground-truth poses
    |
    └-- input/              # seq. for LC-VSLAM input
        |-- video_pers.mp4  # video captured with perspective model
        |-- video_fish.mp4  # video captured with fisheye model
        |-- video_equi.mp4  # video captured with equirectangular model
        └-- poses.txt       # ground-truth poses

Description

We made this dataset with CARLA Simulator. Using urban models (Town02 - Town05) provided by CARLA, we manually steered a car on which a virtual camera is mounted, then recorded an image sequence and a ground-truth trajectory. We captured the sequences using three types of projection models: perspective, fisheye, and equirectangular (except for Seq. #01). Also, each of the sequences has two videos, prebuilt and input, that have some overlaps between them so that they constitute a looping trajectory. Prebuilt ones are used for prebuilt-map creation in our paper, and input ones are for input to LC-VSLAM.

In the following, details on intrinsic parameters, video format, and ground-truth trajectory format are presented. Subsequently, information of each of the sequences is also listed.

Intrinsic Parameters

For each of the projection models, we provide projection parameters, which are needed to reproject a camera-referenced 3D point onto an image plane.

Perspective

There are four pinhole-projection parameters for the perspective model. Note that there is neither radial nor tangential distortion parameter.

Seq.fxfycxcycolsrows
#01320320320180640360
#02 - #126406406403601280720
  • fx, fy: focal lengths (in pixel unit) for x- and y-axes
  • cx, cy: principal points (in pixel unit) for x- and y-axes
  • cols, rows: horizontal and vertical image size (in pixel unit)
The projection process can be described by the following pseudo-code:

a = Xc / Zc
b = Yc / Zc
x = fx * a + cx
y = fy * b + cy

where [Xc, Yc, Zc] is a camera-referenced 3D coordinates of a 3D point.

Fisheye

There are four fisheye-projection parameters for the fisheye model. Note that there is no distortion parameter.

Seq.fxfycxcycolsrows
#02 - #125865866403601280720
  • fx, fy: focal lengths (in pixel unit) for x- and y-axes
  • cx, cy: principal points (in pixel unit) for x- and y-axes
  • cols, rows: horizontal and vertical image size (in pixel unit)
The projection process can be described by the following pseudo-code:

a = Xc / Zc
b = Yc / Zc
r = sqrt(a * a + b * b)
theta = atan(r)
x = fx * (theta / r) * a + cx
y = fy * (theta / r) * b + cy

where [Xc, Yc, Zc] is a camera-referenced 3D coordinates of a 3D point.

Equirectangular

There are no intrinsic parameters for the equirectangular model. Only the image size is needed for the projection process.
Seq.colsrows
#02 - #1221601080
  • cols, rows: horizontal and vertical image size (in pixel unit)
The projection process can be described by the following pseudo-code:

l = sqrt(Xc * Xc + Yc * Yc + Zc * Zc)
bx = Xc / l
by = Yc / l
bz = Zc / l
latitude = -asin(by)
longitude = atan2(bx, bz)
x = cols * (0.5 + longitude / (2 * PI))
y = rows * (0.5 - latitude / PI)

where [Xc, Yc, Zc] is a camera-referenced 3D coordinates of a 3D point.

Video

All of the videos are recorded in H.264 YUV420p format whose framerate is 30.0 FPS.

Ground-truth Poses

All of the trajectory files are formatted in a similar manner as the ones of KITTI dataset. Each line in the trajectory file represents a camera-referenced SE3 camera pose in raw-major format, and the line number corresponds to the frame number of the video.

For example, assume that the following values are written at the N-th line of ./01/prebuilt/poses.txt. A set of these values represents the ground-truth camera pose of the N-th frame of ./01/prebuilt/video_pers.mp4.


r_11, r_12, r_13, t_1, r_21, r_22, r_23, t_2, r_31, r_32, r_33, t_3

These values can be interpreted as a camera-referenced SE3 camera pose Twc, which converts a 3D point in the N-th camera coordinates to the one in the world coordinates.


Twc = [
    [r_11,  r_12,  r_13,  t_1],
    [r_21,  r_22,  r_23,  t_2],
    [r_31,  r_32,  r_33,  t_3],
    [   0,     0,     0,    1]
]

The following is an example of coordinate transformation, where X_c and X_w represent the camera- and world-referenced homogeneous coordinates of a 3D point, respectively.


X_w = T_wc @ X_c

Sequence List

Seq.Length
(prebuilt)
Length
(input)
Resolution Trajectory Direct Link CARLA
world
#011:071:04Perspective:
640x360
[Google Drive] Town05
#022:201:16Perspective:
1280x720

Fisheye:
1280x720

Equirectangular:
2160x1080
[Google Drive] Town02
#032:081:28 [Google Drive] Town02
#041:401:49 [Google Drive] Town04
#052:111:34 [Google Drive] Town04
#062:471:34 [Google Drive] Town05
#072:202:00 [Google Drive] Town05
#082:212:04 [Google Drive] Town03
#092:261:58 [Google Drive] Town03
#102:453:46 [Google Drive] Town03
#112:562:53 [Google Drive] Town03
#122:232:17 [Google Drive] Town03

Citation

@inproceedings{shibuya2020privacy,
  title = {Privacy Preserving Visual {SLAM}},
  author = {Mikiya Shibuya and Shinya Sumikura and Ken Sakurada},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year = {2020}
}

Contact

  • Mikiya Shibuya: shibuya.m.ab <at> m.titech.ac.jp
  • Ken Sakurada: k.sakurada <at> aist.go.jp