Semantic Search for Scala - Post 1
This is the first of a series of posts written (or well, that I will write) during my masters thesis. I’m delighted that you’ve taken an interest in my thesis and if you have any blog posts, videos, or papers that I should read then contact me on Twitter (@Mads_Hartmann).
Setting the scene
Back in september 2012 I was searching for topics for my upcoming masters thesis. After some consideration I finally decided I wanted to work with development tools so I got in contact with Jonas Bonér and asked if Typesafe was interested in working with a masters thesis student. Jonas was kind enough to forward my inquiry to the Typesafe IDE team and after a few technical interviews they agreed to collaborate with me and invited me to Lausanne, Switzerland for a week so we could kick-start the project. Since then I’ve been working on the project here in Copenhagen while taking a course at my university; I just passed the course and will be working full-time on my thesis for the next couple of months.
The goal of the project is to create a semantic search engine for Scala, in the form of a library, and integrate it with the Scala IDE plugin for Eclipse. Part of the solution will be to index all aspects of a Scala code, that is:
- Definitions of the usual Scala elements: classes, traits, objects, methods, fields, etc.
- References to the above elements. Some more challenging case to consider are self-types, type-aliases, code injected by the compiler, and implicits.
With this information the library should be able to
- Find all occurrences of any type of Scala element
- Create a call-hierarchy, this is list all in- and outgoing method invocations, for any Scala method.
- Create a type-hierarchy, i.e. list all super- and subclasses, of a specific type (I won’t necessarily find time to implement this during my thesis but nothing is stopping me from working on the project even after I hand in the report)
Where we’re at now
Before you get too excited I’d better warn you. I haven’t written a single line of code yet. So far I’ve spent my time understanding how other developers have tackled the problem (JDT Java Search, InstaSearch plugin for Eclipse, ENSIME and Scala Refactoring), how the Scala IDE for Eclipse interacts with the Presentation Compiler and recently I’ve started thinking about what phases of the compiler the information we need is accessible in. This is fairly interesting topic so I’ll write a little about it here; For a more lengthy description you can read this document.
To follow along you need to be familiar with the Scala compilers symbols and trees; If you aren’t please read the section Symbols & Trees in this document. You should also know that the IDE uses the Scala Presentation Compiler, a faster asynchronous version of the Scala compiler that only runs a subset of the phases of the original compiler. Specifically you should know that the presentation compiler can provide you with the parse- or typed-tree of a compilation unit.
Parse trees are created in the first phase, the parser. The parse trees are as close to the original textual representation of the source as you can get. It only diverges a few places, for example, some language constructs such as for-comprehensions are desugared to their corresponding map, filter, and flatMap representation. A parse trees contain no information that isn’t stated explicitly in the source, most importantly, it doesn’t contain any symbols. Parse trees are very inexpensive to produce but, as mentioned, contain very little information. However, they have the lovely property that they can be safely used outside of the presentation compiler thread; Why this is will become clear later, don’t worry.
Type trees are created in the fourth phase, the typer. At this point the compiler has type-checked the source and populated the trees with symbols and types. The typed trees are obviously more expensive to produce than the parse trees.
So the question is, what kind of trees should we use to build the index. Offhand this seems like a non-issue, we obviously want a typed trees with all the juicy symbols as they contain more information than the parsed trees. However, there are two things that get in the way of just relying on the typed-trees and symbols. First of, the typed-trees are much, much more computationally expensive to produce and as such will make the indexing slower than if we used the parsed trees. The second issue takes a bit of explanation.
The presentation compiler runs in it’s own thread that has thread-confinement as its synchronization policy. The typed-trees that the compiler works with have attributes that are lazy, that is, the initialization of some of the attributes of the trees is delayed until they’re accessed. If you access one of these lazy members outside of the presentation compiler thread you will cause it to be evaluated in the current thread and hence you’ve broken the synchronization policy. To avoid this you can force a computation to be executed on the presentation compiler thread.
So if we rely completely on the symbols for the indexing we’re going to be blocking the presentation compiler thread, slowing down all the other components of the IDE that uses it. You may think that this is easily mitigated by just spawning several presentation compiler instances, however, as the instances have no way of communicating with each other they would all need to compile the project independently which leads to a huge waste of computation power without any gain. The other extreme is to just disregard the existence of symbols completely but then we will end up re-implementing parts of the logic in the compiler which is also far from ideal. Because of this it is worth considering what features absolutely need access to the symbols and which don’t.
Even if we have access to all the symbols and typed-trees of the program there is information that not even the symbols can provide. For example, you can’t find all the trees that reference a symbol, and if you have the symbol of a class you can easily find all the ancestors, but it won’t list all of the descendants. This is information we need to collect by querying the compiler data structures and storing the information in our index.
With this in mind we can start considering how to implement the features outline above.
Find all occurrences
To implement find all occurrences we need to have a relation between each Scala entity and the places where it is mentioned. To do this we need to have a definitive way to differentiate between entities with the same name (for example two classes with the same name but in different scopes). Making this distinction using a parse tree is non-trivial, however, with a typed-tree it’s extremely simple as we have access to the symbols. This means we need the typed-tree for every compilation unit in the project which we just explained is far from ideal.
One way to avoid this would be to take an approach similar to JDT: By using the parse-tree we create a primitive index that, for each compilation unit, keeps track of all the names that are mentioned in that document.
By using the primitive index we will be able to filter out the compilation units that definitely don’t mention the entity we’re looking for leaving us with a much smaller set of compilation units that might mention the entity. We then ask for a typed tree for each of the remaining compilation units and use the symbols to see if they’re an exact match. This leaves us with the location of the references that are guaranteed to be referencing the entity we were searching for without having to type-check the entire project.
This approach should be able to find all occurrences of Scala entities with a few edge cases listed below (I haven’t considered all cases yet).
Because implicits aren’t expressed in the source the primitive index wouldn’t find any files where the implicits are invoked and we’re back to having to type-check the entire project to find all references. I currently don’t have a solution to this, but as this is probably one of the trickiest problems I don’t mind saving it for later ;)
Using type aliases it is possible to “rename” a type, for example lets
type Str = String val x: Str = “I’m a String in disguise”
In such a case the primitive index would not register Str as an occurrence of String and they would go unnoticed. This could be solved by performing a search for the name of the alias whenever an aliased type is created.
As mentioned a call hierarchy lists all of the in- and outgoing calls of a method. To find all the ingoing calls we simply search for all the references to the method. The outgoing calls can be found by traversing the body of the method looking for Apply nodes.
The type hierarchy should show all of the direct sub- and super-classes. We currently plan to let people browse through the hierarchy one layer at a time similar to how the type hierarchy feature of Scala docs works.
I haven’t given the implementation details much consideration yet but I believe we should be able to find all the super-classes by traversing the Def-tree of the type. As for the sub-classes we could use an approach similar to find all references, as we’re really just looking for all references to the type with the restriction that it should be in the definition of another type. That is we would again use the primitive index to narrow down the number of compilation units. It would be even more efficient if we we store what kind of tree the simple occurrences was tracked in; that would allow us to only compile the files where the name of the type was used in a Def tree.
In the near future
To figure out if this approach is fitting I need to go through all Scala entities and list all the places definitions and occurrences are allowed to appear. After that it’s time to consider how I will synchronize and persist the index (right now I’m leaning towards using Lucene) and then perhaps it’s time to start programming a bit.