如何相交两个排序整型数组没有重复?数组、整型、两个

2023-09-11 01:55:53 作者:买醉。

这是我使用的编程练习的面试问题。

This is an interview question that I am using as a programming exercise.

输入:二排序整型数组A和B在不同尺寸的增加秩序,N和M,分别为

Input: Two sorted integer arrays A and B in increasing order and of different sizes N and M, respectively

输出:的排序整数数组下在递增的顺序,包含的元素出现在A和B

Output: A sorted integer array C in increasing order that contains elements that appear in both A and B

约束上:否允许重复使用C

示例:输入A = {3,6,8,9}和B = {4,5,6,9,10,11},输出应该是C = {6 ,9}

Example: For input A = {3,6,8,9} and B = {4,5,6,9,10,11}, the output should be C = {6,9}

感谢您的答案,所有的!总之,有两个主要的方法来解决此问题:

Thank you for your answers, all! To summarize, there are two main approaches to this problem:

我的原始溶液是保持两个指针,一个用于每个阵列,并扫描从左至右可互换的阵列,而挑选出匹配元素。因此,当我们一个阵列的当前元素比第二阵列较大,我们不断递增第二阵列的指针,直到我们要么找到当前第一个数组元素或立交桥它(找到一个大)。我一直都匹配在一个单独的阵列,它返回一旦我们到达任一输入数组的末尾。

My original solution was to keep two pointers, one for each array, and scanning the arrays from left to right interchangeably, while picking out elements that match. So when we the current element of one array is larger than the second array, we keep incrementing the pointer of the second array until we either find the current first array element or overpass it (find one larger). I keep all matched in a separate array, which is returned once we reach the end of either one of the input arrays.

这是我们可以做到这一点的另一种方法是扫描线性阵列中的一个,而使用二进制搜索找到第二个阵列中的比赛。这意味着O(N *日志(M))的时候,如果我们扫描和基于B它的每一个N个元素的二进制搜索(O(日志(M))的时间)。

Another way that we could do this is to scan one of the arrays linearly, while using binary search to find a match in the second array. This would mean O(N*log(M)) time, if we scan A and for each of its N elements binary search on B (O(log(M)) time).

我实现了这两种方法,跑了一个实验,看看这两个比较(这个细节可以发现,here).二进制搜索方法似乎赢得当M大于N约为70倍,当N拥有百万元。

I've implemented both approaches and ran an experiment to see how the two compare (details on this can be found here). The Binary Search method seems to win when M is roughly 70 times larger than N, when N has 1 million elements.

推荐答案

这个问题本质上减少了一个的加入的操作,然后一个的过滤器的操作(删除重复仅保持内心的比赛)。

This problem essentially reduces to a join operation and then a filter operation (to remove duplicates and only keep inner matches).

作为输入都已经排序,该连接可有效地通过一个实现合并联接,与O(尺寸(A)+尺寸(B))。

As the inputs are both already sorted, the join can be efficiently achieved through a merge join, with O(size(a) + size(b)).

在过滤器的操作将是O(N)由于连接的输出进行排序,并删除重复所有你所要做的就是检查每一个元素是一样的前一。过滤只有内场是微不足道的,你只是丢弃不匹配的任何元素(外部联接)。

The filter operation will be O(n) because the output of the join is sorted and to remove duplicates all you have to do is check if the each element is the same as the one before it. Filtering only the inner matches is trivial, you just discard any elements that were not matched (the outer joins).

有并行(无论是在加入和过滤器),以获得更好的表现机会。例如,阿帕奇猪对Hadoop框架提供的合并并行执行加入。

There are opportunities for parallelism (both in the join and filter) to achieve better performance. For example the Apache Pig framework on Hadoop offers a parallel implementation of a merge join.

有性能和复杂性(从而可维护性)之间有明显的权衡。所以,我要说很好地回答一个面试问题确实需要采取的性能要求帐户。

There are obvious trade-offs between performance and complexity (and thus maintainability). So I would say a good answer to a interview question really needs to take account of the performance demands.

根据设置的比较 - O(nlogn) - 相对缓慢,很简单,使用,如果没有性能问题。简单获胜。

Set based comparison - O(nlogn) - Relatively slow, very simple, use if there are no performance concerns. Simplicity wins.

合并连接+过滤器 - 为O(n) - 快速,容易出现编码错误,如果使用 性能是一个问题。理想的情况下尽量利用现有的库要做到这一点,或者如果合适的话甚至可能使用的数据库。

Merge join + Filter - O(n) - Fast, prone to coding error, use if performance is an issue. Ideally try to leverage an existing library to do this, or perhaps even use a database if appropriate.

并行实现 - O(N / P) - 非常 速度快,需要的地方等基础设施建设,使用,如果成交量 非常大的和预期成长,这是一个重大的性能 瓶颈。

Parallel Implementation - O(n/p) - Very fast, requires other infrastructure in place, use if the volume is very large and anticipated to grow and this is a major performance bottleneck.

(另请注意,在这个问题的功能的 intersectSortedArrays 的实质上是一种改性的合并连接,在过滤器中的连接完成的。你可以在没有任何性能损失后过滤,虽然略有增加内​​存足迹)。

(Also note that the function in the question intersectSortedArrays is essentially a modified merge join, where the filter is done during the join. You can filter afterwards at no performance loss, although a slightly increased memory footprint).

最后一个想法。的

其实,我相信大部分现代商业的RDBMS提供线程并行在其实施的加入,所以什么的Hadoop版本提供的是机器级并行(分布)。从设计的角度来看,也许是一个很好的,简单的解决问题的办法是把数据上的数据库,索引A和B(有效地对数据进行排序),并使用SQL内连接。

In fact, I suspect most modern commercial RDBMSs offer thread parallelism in their implementation of joins, so what the Hadoop version offers is machine-level parallelism (distribution). From a design point of view, perhaps a good, simple solution to the question is to put the data on a database, index on A and B (effectively sorting the data) and use an SQL inner join.