347. Top K Frequent Elements

solution

直接法

import collections

class Solution:
    def topKFrequent(self, nums: List[int], k: int) -> List[int]:
        res = dict(collections.Counter(nums))
        res = sorted(res.items(), key=lambda x: x[1], reverse=True)
        return [i[0] for i in res[-k:]]

时间复杂度：O() 空间复杂度：O()

堆
- 优化点在于选出最大的k个，不需要对全部排序。只需要前k排序。优先队列/堆
- 堆解决top k问题时，最后pop消耗空间较大，迭代过程中超过k立即pop优化空间

import collections
import heapq

class Solution:
    def topKFrequent(self, nums: List[int], k: int) -> List[int]:
        freq_dict = collections.defaultdict(int)
        for num in nums:
            freq_dict[num] += 1

        heap = []
        for num, freq in freq_dict.items():  # heap如果最小堆中按第一个元素升序顺序排
            heapq.heappush(heap, (-freq, num))

        res = []
        for _ in range(k):
            res.append(heapq.heappop(heap)[1])
        return res

时间复杂度：O(nlog(k)) 空间复杂度：O(n)

桶排序 bucket sort

class Solution:
    def topKFrequent(self, nums: List[int], k: int) -> List[int]:
        bucket = [[] for _ in range(len(nums) + 1)]
        res = []
        freq = collections.Counter(nums)

        for key, val in freq.items():
            bucket[val].append(key)  # key是num, val是出现次数. 填充桶时，次数最多的越往大的桶放

        for i in range(len(bucket) - 1, -1, -1):  # 从次数大的桶里往回找，每个桶本身是key
            res += bucket[i]
            if len(res) == k:
                return res

时间复杂度：O(n) 空间复杂度：O(n)

follow up

常见题目如Top K frequent words, Merge K sorted list, 从5亿个数里面找出不重复的数等，都可以问数据量很大，内存不够，怎么办？
小规模: 一般都可以用堆，哈希表，哈希集读到内存里面来做
大规模: map-reduce，哈希桶，bitmap（位压缩），多线程，等

有一个 node, limited memory，这样的情况可以 read chunk by chunk, 依次把每个 chunk 里的 words 排序。然后每个 chunk 分配一个指针，初始的时候指向每个 sorted chunk 里的第一个 word. 同时在内存里保留一个 size-k heap, 这样通过移动指针可以找到 total count of each word across all chunks. 然后把 (count, word) 放到 heap 里，同时保证 heap 的 size < k. 常见的错误方法是先在每个 chunk 里找 top K, 然后再找合并后的 top K，这样的话 local top K 会影响 global top K, 最后的结果可能会是错误的。
有多个 node, 每个 node limited memory, 这样就是 map reduce, 重点在于如何 partition, 有很多人会考虑 range based partition, 比如以 A 开头的 word 分到一个 node, B 开头的到另一个... 但问题在于以有些字母开头的词会更常见，所以有可能导致 unbalanced loads. 更好的方法可以是 hashing.

Previous692. Top K Frequent Words Next973. K Closest Points

Last updated 21 days ago