Improve SQL query performance when distinct keyword is used

Description

Queries with the distinct keyword performs poorly when large datasets shares a relatively small number of distinct values. This seems to be due to the use of ArrayList$Itr.remove(), which performs System.arrayCopy once per non distinct item (cost O). For large dataset will few distinct values, this will lead to large amounts of arrayCopies with O(n^2) complexity. Se execution time for the query "select distinct(data) from MyObject" from the supplied code example.

distinct with all values different (i.e. no duplicate values)
10000 items, 148 ms
50000 items, 269 ms
100000 items, 438 ms
200000 items, 750 ms
400000 items, 1803 ms
800000 items, 3572 ms

select distinct(data) from MyObject, distinct with all values equal (i.e. 1 distinct value)
10000 items, 14 ms
50000 items, 179 ms
100000 items, 741 ms
200000 items, 2954 ms
400000 items, 12119 ms
800000 items, 47411 ms

Same behaviour in GS 10.2 with JDK 8 and GS 14 with OpenJDK 11

Acceptance Test

Ran the provided Main

Status

Assignee

Yohana Khoury

Reporter

Axel Hjälm

Labels

Priority

Minor

SalesForce Case ID

None

Fix versions

Commitment Version/s

None

Due date

None

Product

XAP

Edition

Enterprise

Platform

Java
Configure