ReCo Dataset, the dataset files can be found at https://www.kaggle.com/datasets/fdudsde/reco-dataset.
Layout planning is centrally important in the field of architecture and urban design. Among the various basic units carrying urban functions, residential community plays a vital part for supporting human life. Therefore, the layout planning of residential community has always been of concern, and has attracted particular attention since the advent of deep learning that facilitates the automated layout generation and spatial pattern recognition. However, the research circles generally suffer from the insufficiency of residential community layout benchmark or high-quality datasets, which hampers the future exploration of data-driven methods for residential community layout planning. The lack of datasets is largely due to the difficulties of large-scale real-world residential data acquisition and long-term expert screening. In order to address the issues and advance a benchmark dataset for various intelligent spatial design and analysis applications in the development of smart city, we introduce Residential Community Layout Planning (ReCo) Dataset, which is the first and largest open-source vector dataset related to real-world community to date. ReCo Dataset is presented in multiple data formats with 37,646 residential community layout plans, covering 598,728 residential buildings with height information. ReCo can be conveniently adapted for residential community layout related urban design tasks, e.g., generative layout design, morphological pattern recognition and spatial evaluation. To validate the utility of ReCo in automated residential community layout planning, a Generative Adversarial Network (GAN) based generative model is further applied to the dataset. We expect ReCo Dataset to inspire more creative and practical work in intelligent design and beyond. The ReCo Dataset is published at: https://www.kaggle.com/fdudsde/reco-dataset.READ FULL TEXT VIEW PDF
ReCo Dataset, the dataset files can be found at https://www.kaggle.com/datasets/fdudsde/reco-dataset.
Layout tasks in architecture and urban planning refer to the physical arrangement of urban spatial components at different scales, in a creative and functionally reasonable way , where building floor layout planning, community layout planning and urban planning are the typical tasks from fine- to coarse-grained (shown in Fig. 1), contributing significantly to the quality and sustainability of buildings, neighborhoods or entire cities. The essence of layout planning is an analytical and problem-solving activity that has to meet various requirements and specifications, while traditional expert-empirical-led planning methods are becoming sluggish in the increasingly complex contemporary design context. Recent years have witnessed a rapid growing research interest in intelligent design, e.g., design pattern recognition [2, 3], building volume generation , and street network generation , using advanced data-driven approaches, which greatly promote the quantitative digital layout planning tasks in a more automated, rational, and efficient way.
Among the three typical layout tasks, the research on building floor layout planning, represented by indoor furniture placement  and floor plan generation [7, 8, 9, 10], is the most active, thanks in large part to the relatively high-quality mature datasets. In contrast, studies on data-driven urban planning and residential community layout planning are still limited [11, 12]. As a key task that directly affects the quality of residence life and urban environmental space , residential community layout planning plays a linking role between floor layout and urban planning. However, the effective work cannot be carried out widely due to the lack of large-scale, reliable, and open-source benchmark dataset. Specifically, current work on automatic generative residential layout and design mainly relies on rule-based approaches [12, 14, 4, 15]. Although some efforts have been made to apply Generative Adversarial Networks (GAN)  to generative design, the datasets involved cannot be accessed publicly . Besides, the relevant analytic tasks, e.g., community layout pattern recognition [2, 3, 18, 19, 20]
, also leave the issue of limited performance and inadequate datasets. Despite some algorithms, e.g., online reinforcement learning, show promising results to rely less on data or even require no historical data during training processes, sufficient data provided by dataset is still essential when it comes to model effectiveness evaluation [22, 23].
To resolve the data inadequacy issues and pave the way to robust and open data-driven modelling for residential community layout related tasks, in this paper, we introduce the Residential Community Layout Planning Dataset (ReCo Dataset), which is by far the first and largest vector dataset based on real-world residential communities. The ReCo Dataset contains 37,646 residential community layout plans sampled from 60 different cities covering 598,728 residential buildings. The height information of buildings is also included for the extension of two-dimensional (2D) information to 3D space, making ReCo applicable to more 3D modeling and spatial evaluation tasks . Unlike other raster image-based datasets in architectural design fields, e.g., LIFULL HOME’s dataset  and RPLAN dataset , ReCo provides more fine-grained coordinate information, through which commonly used raster data or vector (or polygon) data formats, such as image, Shapefile , as well as 3D geometry (with height information) can be exported flexibly. By providing data in this form, the spatial attribute information of buildings, including distance and size, can be preserved, so that the dataset can adapt to different precision generation and analysis tasks.
In addition to ReCo Dataset construction involving data collection and processing, this paper also demonstrates the dataset usability and benchmark in one of the principal downstream data-driven tasks, i.e., automated residential community layout planning, using the Deep Convolutional Generative Adversarial Network (DCGAN)  as backbone. The experiment result confirms the potential of applying our dataset for the tasks in architecture and urban design.
The contributions of our paper can be summarized as follows:
We release ReCo, the first large-scale open-source residential community layout planning dataset. ReCo can be applied to numerous promising applications such as generative layout planning, pattern recognition, classification and spatial evaluation.
ReCo is a diverse and extensive dataset containing layout information of 37,646 residential communities and 598,728 buildings across 60 cities. These sample cities span a large geographical area covering inland cities, coastal cities, etc., with different urban characteristics.
ReCo is a fine-grained coordinate information-based dataset that can be flexibly exported to various common spatial file formats. It provides more spatial attribute information than image-based datasets, so that can be applied to a wider range of tasks at different scales.
We build a generative model to validate dataset usability, which can serve as a baseline for benchmarking the task of automated residential community layout planning. We believe ReCo has a great potential to expedite researches in the growing field of intelligent layout planning.
In this section, we firstly review the related methods of the three typical layout tasks, namely community layout planning, building floor layout planning and urban planning. Then we conclude the datasets applied to these tasks. In general, the maturity and scale of existing datasets vary a lot in terms of layout types, resulting in inconsistent development of technical methods in the field. It is particularly time-consuming and challenging to collect high-quality data for community layout research, thus there are only a few unavailable small-scale datasets at community level.
The study of building floor layout planning has been at the forefront of the three, with numerous efforts using advanced algorithms to automate and intelligize task, which has largely replaced the traditional experience oriented design method. Earlier research is mainly based on computer-aided methods by exploiting explicit rules, i.e., translating domain prior knowledge into computer algorithms [28, 29, 30, 6]. However, methods based on finite rules are bound to be difficult to deal with the complex relational modeling in layout planning, the development of related datasets and models provides solutions to this challenge [31, 32]. As for community layout planning tasks, the existing research mainly focuses on community layout pattern recognition and classification [20, 2]. In contrast with building floor layout planning, there are few studies on community generative design. The development of this field is hindered by the small-scale, closed-source or proprietary datasets [4, 17, 33]. For the coarse-grained tasks of urban planning, most of the current research focuses on design optimization, generative design, and urban environmental evaluation [14, 5, 34, 12]35] and GAN are under-utilized in design optimization . Only a few methods are developed to generate road networks, but exhibit potential to enhance generative urban design 
. As for urban environmental evaluation, when numerous objective evaluation metrics are required, only a concept of interactive machine learning integrating clustering, feature extraction, and human subjective-oriented Reinforcement Learning has been proposed  where adequate data is particularly important to help establish the objective reward function and evaluation system.
Two commonly used open-source datasets for building floor layout planning tasks are RPLAN dataset  and LIFULL HOME dataset , which offer 80K annotated house layouts and 5 million ground-truth floorplans, respectively. They have been applied to automatic floorplan generative models [36, 9, 10, 37]. With large-scale datasets provide the sufficient training data for GAN, existing models can automatically formulate floor plans that are indistinguishable from the ground-truth [9, 10]. Community layout planning models are struggling with limited amount of data. For example, Dong et al.  proposed a Convolutional Auto-Encoder model to embedding plots by applying a dataset with 1,887 samples. The study suggests that larger-scale data covering more attributes can help improve model performance and utility. Bei et al.  introduced Graph Convolutional Network (GCN)  to accomplish different tasks of building state identification, building node clustering and building pattern recognition. Nevertheless, the individual blocks containing building contours in dataset are randomly selected rectangular areas rather than strict community boundaries. In addition, Yan et al. 
presented a GCN to classify building patterns which also remains limited by the small dataset (2,194 samples). For the work on intelligent community layout planning,Cheng et al.  applied the Convolutional GAN to generate residential layout planning while the training process is still limited by the small sample size of 1,050. The diversity of data is also crucial for pattern recognition and layout generation tasks since it can help provide more sufficient information and wider range of data generation distribution [39, 40]. Furthermore, due to the lack of benchmark datasets, it is difficult to evaluate and compare the performance of different models. In summary, data-driven community layout planning tasks rely heavily on datasets. In terms of urban planning, Hartmann et al.  applied data extracted from OpenStreetMap (OSM) 333https://www.openstreetmap.org/ to generate road networks. Although OSM contains a large amount of raw geographic data, it cannot be directly applied to model training without complex preprocessing. We summarize and compare the above datasets and ReCo dataset, as shown in Tab. 2.
Generally, the smaller scale building floor layout planning has been more widely studied, especially in generation tasks, benefiting from numerous relatively mature datasets; while the development of the larger-scale of urban planning and community layout planning is subject to the lack of high-quality benchmark datasets. Regarding community layout planning, there is still a lot of room for improvement in the performance of data-driven models, which highlights the urgent need of large-scale open-source benchmark datasets for this research area. We hope to tackle the field development issues through the release of ReCo Dataset, thereby accelerating the maturity of data-driven methods for community layout planning, and even for the smaller- or larger-scale layout tasks in the field of architecture design.
In this section, we describe the unique characteristics of ReCo, as well as the pipeline to acquire the dataset from the collection of raw data in real-world residential communities, to the calibration and standardization of community boundaries and building outlines.
The ReCo dataset is based on high-precision vector coordinates rather than raster images, which can be easily converted to various data types, such as 2D image, 3D geometry and Shapefile. The properties of ReCo can be summarized in the following three major points:
ReCo covers residential community layout data in 60 cities, with different scales of residential areas, residential distribution characteristics, and residential forms, which constitute the diversity of the dataset. See Section 3.4 for specific statistics. ReCo allows researchers to classify datasets for different research purposes based on features that are not limited to location, number of buildings, and plot size.
To our knowledge, ReCo is so far the first and largest open-source vector dataset related to real-world community. The form of the dataset based on coordinates makes it flexible to output various common spatial data formats, and retain the original information of data to the greatest extent.
ReCo can serve as a standard dataset for the residential community layout planning related tasks, providing a benchmark for evaluating the performance of various data-driven models, to facilitate the convergence and progress of advanced algorithms and urban planning.
ReCo is provided with data-interchange formats of JSON and GeoJSON  that describes the information of coordinates and spatial attributes. These types of vector data formats can support commonly used spatial format conversions. To make it easier for users to apply ReCo to image-based layout tasks, we provide a way to generate the 2D image from existing dataset formats, with the code published at GitHub444Related code is at our GitHub repository: https://github.com/FDUDSDE/ReCo-Dataset .
To explain the content of the ReCo Dataset in detail, we randomly select an example community (shown in Fig. 2 (a)) and convert the corresponding data in ReCo Dataset into 2D image (shown in Fig. 2 (b)). ReCo consists of three types of instances, namely building, residential community and city. While the basic elements for describing the instances are coordinates. We summarize the basic element and instance types as following:
Coordinate: geographical coordinates have been converted to Mercator coordinates  and desensitized for legal and privacy concerns.
Building: residential buildings arranged in the community which consists of a set of coordinates describing the outline, and each building has a unique identifier (“building_id”) within its city limits. The building height attribute is represented by “building floor”, with an assumption that each floor is 3 meters high.
Community: the community where buildings are located which can be recognized by the “_id” (the unique identifier, automatically generated by MongoDB555https://www.mongodb.com) with city attribute to explain the location. A set of coordinates (community boundary coordinates) describe community’s outline, and the value of coordinates is constrained to be non-negative.
City: the city where communities are located. In the ReCo, “city” is also considered as one of the attributes of communities. A set of community instances with the same “city” attribute is a sub-set of the dataset.
In order to capture the morphology of residential community layout plan on a large scale, we prepare two parts of raw data. The first is the information of buildings in the map including buildings’ vector outline and height information (as Building Morphology Data) collected from OpenStreetMap 3 and Google Earth Engine666https://developers.google.com/earth-engine . The second is the residential community information including boundary coordinates information (as Community Morphology Data) acquired through the Baidu Map APIs777https://lbsyun.baidu.com .
The visualized dataset generation pipeline is presented in Fig. 3. As shown in Fig. 3, after two parts of raw data are collected, the information of corresponding coordinates is extracted. Since the geographic coordinate systems of these multi-source data are different, unification is required to align the two parts of coordinate data. The data from different geographic coordinate systems are projected onto the same 2D plane, i.e., the Transverse Mercator projection . Due to the sensitivity of the Geo information, coordinate correction and desensitization are added after the unification process. Next, we are allowed to determine whether the building belongs to the community by calculating whether the building centroid is within the area enclosed by the community boundary, under the unified spatial environment. Finally, the two parts of data are combined into one as the pipeline output, which completely describes the information of residential community layout plannings. In addition, the information of the building height is also kept. We save the data in JSON and GeoJSON formats for users to export images or Geographic maps.
|Stats||# of Communities||# of Buildings||Average|
In order to demonstrate the ReCo more specifically, we provide statistics of the ReCo (shown in Tab. 1). Significantly, all data in ReCo comes from the real world and meets the realistic requirements for the layout of residential buildings. Comparing the datasets for the three typical tasks mentioned in Section 2, the ReCo dataset is by far the largest and the only open-source dataset in the community layout planning field (shown in Tab. 2). Moreover, the data volume of ReCo has increased by more than 17 times compared to previous largest residential layout dataset . In addition, ReCo has the widest data distribution with samples from 60 different cities, which increases the diversity of the dataset. However, compared with the two datasets [25, 24] commonly used in the building floor layout planning, ReCo still has a huge room for improvement in data volume.
The GAN  models have achieved a breakthrough in the field of building floor layout planning[7, 9, 10]. We expect that GANs can also be applied to residential community layout planning generation tasks if supported by sufficient data. Based on the ReCo Dataset, therefore, we propose a residential community layout planning generative model, and conduct a baseline experiment. We aim to answer the following research questions:
RQ1: Can our dataset be applied to residential community layout planning generative tasks, and how does it perform?
RQ2: How does the size of training dataset affect the performance of residential community layout planning generative model?
RQ3: What is the effect of different data distribution (i.e., sampled from different regions) on training of the model?
|city_60||the data from the city in the ReCo||3,974|
|h_city_60||randomly sampled from the city_60||2,000|
|city_40||the data from the city in the ReCo||2,095|
|ReCo||the whole ReCo Dataset||37,646|
We trained a DCGAN-based  generative model for residential community layout planning by applying ReCo, to demonstrate the applicability of our dataset (our experimental code is redeveloped based on the GitHub repository888https://github.com/eriklindernoren/PyTorch-GAN). The model architecture is illustrated in Fig. 4. ReCo and its three subsets used in the experiments are summarized in Tab. 3
. The model was trained for 2K epochs with a batch size of 128 per sub-experiment. Detailed training information can be found in our repository4.
To evaluate the performance of the GAN model, it is common practice to use the Fréchet Inception Distance (FID) scores [43, 44], which measures the distance between the distribution of real data and generated data (a smaller FID score means the generated data is closer to the real data). This evaluation method has also been used in HouseGAN++ , which is one of the typical models in the field of building floor layout planning. We compare the four experiments by dividing them into two subgroups and the experiment results measured by FID scores are shown in Fig. 6 (a) and (b) respectively.
To demonstrate the model generation performance based on ReCo, we sample the images generated by model trained on the complete ReCo (shown in Fig. 5). As shown in Fig. 5 (a), (b) and (d), we can conclude that part of the generated data has the morphological characteristics of the real data. However, as shown in Fig. 5 (c), there still are communities with uneven building spacing in the generated results. Also, Fig. 5 (e) and (f) show that the existing model performs poorly in the generation of communities with irregular boundaries. Optimizing for this situation remains a challenge, and ReCo provides the data foundation for it.
As shown in Fig. 6 (a), we observe that the FID score decreases, i.e., the performance of model increases, as the dataset size increases. Besides, it can be found in Fig. 6 (b), that the model trained on the ReCo has better performance than trained only on “city_60”. Similarly, the performance of the model trained on “city_60” is better than trained on “city_40”. These observations indicate that the effect of the dataset size on the experiment, that is, in the case of the same data distribution, the more training data, the better the experimental results.
As shown in Fig. 6 (b), for the diagonal grid, the score is the lowest in each row. This demonstrates that the FID score between generated data and the corresponding ground-truth data is the lowest. From the perspective of columns, we can see that the lowest scores appear all in the last row, which means the model trained on sufficient data (ReCo) performs better than the model trained on single city data. Moreover, the high scores are seen when comparing data from different cities, e.g., comparing the generated data of model trained by “city_40” to ground-truth data of “city_60”. From these observations, we can conclude that the generated data distributions are closer to the corresponding ground-truth data and there is a certain gap in the data distribution of different cities. This also reflects the diversity of our datasets. However, in the case of sufficient training data, the influence of different training data distribution can be gradually ignored.
We provide a desensitization dataset ReCo based on coordinate and spatial attribute information, which can be flexibly exported to multiple common spatial file formats. The potential of the ReCo to help researchers build relevant models for the residential community task is also demonstrated in the study. However, there is still some room for improvement in this work. For instance, the Points of Interest (POI) around the community can help researchers obtain contextual information such as functionality, but such information has not been included in the dataset.
This dataset is somewhat of a mapping of the real-world communities, which means that the dataset should be updated over time. However, most studies are based on historical data, which does not affect research that applies our current dataset. Nonetheless, constantly updating, improving, and maintaining datasets remains a challenge.
The ReCo Dataset can be extended in the future by collecting and adding more raw data, and can be classified into sub-datasets according to different attributes, such as geographical environment, and community area. Experimental results show that there still are plenty of room to improve the planning effects. This indicates challenges remained in applying ReCo and better models with specific designs are welcome. Our dataset currently only covers residential buildings, we would like to expand the dataset by including other building types, e.g., commercial buildings, and urban complexes, to stimulate more related work, including urban design with different scales and mixed functions. Furthermore, ReCo can also be applied to multiple architectural design tasks, such as obtaining evaluation metrics for designs, and evaluating performance of design schemes or models.
In this paper, we introduce Residential Community Layout Planning (ReCo) Dataset, a novel scalable open-source vector dataset related to real-world communities. The current version of the dataset contains 37,646 community layout plans across 60 cities, covering 598,728 residential buildings. The building height information is also included for the extension of 2D information to 3D space. Moreover, we demonstrate the great potential of data-driven models for the automatic generation of community layouts based on our dataset. We expect that our dataset will stimulate extensive research on data-driven approaches for enabling all stages of architecture and urban design.
This work is funded in part by the National Natural Science Foundation of China Projects No. U1936213, No.62176185. This work is also partially supported by the Shanghai Science and Technology Development Fund No. 2021SHZDZX0100, 19DZ1200802, the Fundamental Research Funds for the Central Universities.
Recognition of building group patterns in topographic maps based on graph partitioning and random forest.ISPRS Journal of Photogrammetry and Remote Sensing, 136:26–40, 2018.
A graph convolutional neural network for classification of building patterns using spatial vector data.ISPRS journal of photogrammetry and remote sensing, 150:259–273, 2019.
New quantitative approach for the morphological similarity analysis of urban fabrics based on a convolutional autoencoder.IEEE Access, 7:138162–138174, 2019.
For all authors…
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? See Section 1.
Did you describe the limitations of your work? See Section 5.
Did you discuss any potential negative societal impacts of your work? We believe there is no potential negative impacts of our work.
Have you read the ethics review guidelines and ensured that your paper conforms to them? See ReCo Datasheet (one of the supplemental materials).
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results?
Did you include complete proofs of all theoretical results?
If you ran experiments (e.g. for benchmarks)…
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See our GitHub repository 4.
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Since our experiments are based on a DCGAN-similar  model and only to demonstrate the capabilities of our dataset, we did not run the experiments several times.
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See our GitHub repository 4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
Did you mention the license of the assets? See ReCo Datasheet and our GitHub repository 4.
Did you include any new assets either in the supplemental material or as a URL? See Section 3
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? See Section 3.3.
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? See Section 3.3.
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable?
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?