2022
Software development is largely dependent on libraries to reuse existing functionalities instead of reinventing the wheel. Software developers often need to find analogical libraries (libraries similar to ones they are already familiar with) as an analogical library may offer improved or additional features. Developers also need to search for analogical libraries across programming languages when developing applications in different languages or for different platforms. However, manually searching for analogical libraries is a time-consuming and difficult task. This paper presents a technique, called XLibRec, that recommends analogical libraries across different programming languages. XLibRec collects Stack Overflow question titles containing library names, library usage information from Stack Overflow posts, and library descriptions from a third party website, Libraries.io. We generate word-vectors for each information and calculate a weight-based cosine similarity score from them to recommend analogical libraries. We performed an extensive evaluation using a large number of analogical libraries across four different programming languages. Results from our evaluation show that the proposed technique can recommend cross-language analogical libraries with great accuracy. The precision for the Top-3 recommendations ranges from 62-81% and has achieved 8-45% higher precision than the state-of-the-art technique.
Multi-attribute dataset visualizations are often designed based on attribute types, i.e., whether the attributes are categorical or numerical. Parallel Sets and Parallel Coordinates are two well-known techniques to visualize categorical and numerical data, respectively. A common strategy to visualize mixed data is to use multiple information linked view, e.g., Parallel Coordinates are often augmented with maps to explore spatial data with numeric attributes. In this paper, we design visualizations for mixed data, where the dataset may include numerical, categorical, and spatial attributes. The proposed solution SET-STAT-MAP is a harmonious combination of three interactive components: Parallel Sets (visualizes sets determined by the combination of categories or numeric ranges), statistics columns (visualizes numerical summaries of the sets), and a geospatial map view (visualizes the spatial information). We augment these components with colors and textures to enhance users' capability of analyzing distributions of pairs of attribute combinations. To improve scalability, we merge the sets to limit the number of possible combinations to be rendered on the display. We demonstrate the use of Set-stat-map using two different types of datasets: a meteorological dataset and an online vacation rental dataset (Airbnb). To examine the potential of the system, we collaborated with the meteorologists, which revealed both challenges and opportunities for Set-stat-map to be used for real-life visual analytics.
Abstract Software engineering (SE) methodologies are widely used in both academia and industry to manage the software development life cycle. A number of studies of SE methodologies involve interviewing stakeholders to explore the real‐world practice. Although these interview‐based studies provide us with a user's perspective of an organization's practice, they do not describe the concrete summary of releases in open‐source social coding platforms. In particular, no existing studies investigated how releases are evolved in open‐source coding platforms, which assist release planners to a large extent. This study explores software development patterns followed in open‐source projects to see the overall management's reflection on software release decisions rather than concentrating on a particular methodology. Our experiments on 51 software origins (with 1777k revisions and 12k releases) from the Software Heritage Graph Dataset (SWHGD) and their GitHub project boards (with 23k cards) reveal reasonably active project management with phase simplicity can release software versions more frequently and can follow the small release conventions of Extreme Programming. Additionally, the study also reveals that a combination of development and management activities can be applied to predict the possible number of software releases in a month ( ).
Source code repositories allow developers to manage multiple versions (or branches) of a software system. Pull-requests are used to modify a branch, and backporting is a regular activity used to port changes from a current development branch to other versions. In open-source software, backports are common and often need to be adapted by hand, which motivates us to explore backports and backporting challenges and strategies. In our exploration of 68,424 backports from 10 GitHub projects, we found that bug, test, document, and feature changes are commonly backported. We identified a number of backporting challenges, including that backports were inconsistently linked to their original pull-request (49%), that backports had incompatible code (13%), that backports failed to be accepted (10%), and that there were backporting delays (16 days to create, 5 days to merge). We identified some general strategies for addressing backporting issues. We also noted that backporting strategies depend on the project type and that further investigation is needed to determine their suitability. Furthermore, we created the first-ever backports dataset that can be used by other researchers and practitioners for investigating backports and backporting.
2021
Changes in spatiotemporal data may often go unnoticed due to their inherent noise and low variability (e.g., geological processes over years). Commonly used approaches such as side-by-side contour plots and spaghetti plots do not provide a clear idea about the temporal changes in such data. We propose ContourDiff, a vector-based visualization over contour plots to visualize the trends of change across spatial regions and temporal domain. Our approach first aggregates for each location, its value differences from the neighboring points over the temporal domain, and then creates a vector field representing the prominent changes. Finally, it overlays the vectors along the contour paths, revealing differential trends that the contour lines experienced over time. We evaluated our visualization using real-life datasets, consisting of millions of data points, where the visualizations were generated in less than a minute in a single-threaded execution. Our experimental results reveal that ContourDiff can reliably visualize the differential trends, and provide a new way to explore the change pattern in spatiotemporal data.
Many organizations use legacy systems as these systems contain their valuable business rules. However, these legacy systems answer the past requirements but are difficult to maintain and evolve due to old technology use. In this situation, stockholders decide to renovate the system with a minimum amount of cost and risk. Although the renovation process is a more affordable choice over redevelopment, it comes with its risks such as performance loss and failure to obtain quality goals. A proper test process can minimize risks incorporated with the renovation process. This work introduces a testing model tailored for the migration and re-engineering process and employs test automation, which results in early bug detection. Moreover, the automated tests ensure functional sameness between the old and the new system. This process enhances reliability, accuracy, and speed of testing.
Software architectural changes involve more than one module or component and are complex to analyze compared to local code changes. Development teams aiming to review architectural aspects (design) of a change commit consider many essential scenarios such as access rules and restrictions on usage of program entities across modules. Moreover, design review is essential when proper architectural formulations are paramount for developing and deploying a system. Untangling architectural changes, recovering semantic design, and producing design notes are the crucial tasks of the design review process. To support these tasks, we construct a lightweight tool [4] that can detect and decompose semantic slices of a commit containing architectural instances. A semantic slice consists of a description of relational information of involved modules, their classes, methods and connected modules in a change instance, which is easy to understand to a reviewer. We extract various directory and naming structures (DANS) properties from the source code for developing our tool. Utilizing the DANS properties, our tool first detects architectural change instances based on our defined metric and then decomposes the slices (based on string processing). Our preliminary investigation with ten open-source projects (developed in Java and Kotlin) reveals that the DANS properties produce highly reliable precision and recall (93-100%) for detecting and generating architectural slices. Our proposed tool will serve as the preliminary approach for the semantic design recovery and design summary generation for the project releases.
Evolutionary coupling is a well investigated phenomenon in software maintenance research and practice. Association rules and two related measures, support and confidence, have been used to identify evolutionary coupling among program entities. However, these measures only emphasize the co-change (i.e., changing together) frequency of entities and cannot determine whether the entities co-evolved by experiencing related changes. Consequently, the approach reports false positives and fails to detect evolutionary coupling among infrequently co-changed entities. We propose a new measure, identifier correspondence (id-correspondence), that quantifies the extent to which changes that occurred to the co-changed entities are related based on identifier similarity. Identifiers are the names given to different program entities such as variables, methods, classes, packages, interfaces, structures, unions etc. We use Dice-Sørensen co-efficient for measuring lexical similarity between the identifiers involved in the changed lines of the co-changed entities. Our investigation on thousands of revisions from nine subject systems covering three programming languages shows that id-correspondence can considerably improve the detection accuracy of evolutionary coupling. It outperforms the existing state-of-the-art evolutionary coupling based techniques with significantly higher recall and F-score in predicting future co-change candidates.
When a programmer changes a particular code fragment, the other similar code fragments in the code-base may also need to be changed together (i.e., co-changed) consistently to ensure that the software system remains consistent. Existing studies and tools apply clone detectors to identify these similar co-change candidates for a target code fragment. However, clone detectors suffer from a confounding configuration choice problem and it affects their accuracy in retrieving co-change candidates.In our research, we propose and empirically evaluate a lightweight co-change suggestion technique that can automatically suggest fragment level similar co-change candidates for a target code fragment using WA-DiSC (Weighted Average Dice-Sørensen Co-efficient) through a context-sensitive mining of the entire code-base. We apply our technique, FLeCCS (Fragment Level Co-change Candidate Suggester), on six subject systems written in three different programming languages (Java, C, and C#) and compare its performance with the existing state-of-the-art techniques. According to our experiment, our technique outperforms not only the existing code clone based techniques but also the association rule mining based techniques in detecting co-change candidates with a significantly higher accuracy (precision and recall). We also find that File Proximity Ranking performs significantly better than Similarity Extent Ranking when ranking the co-change candidates suggested by our proposed technique.
Applications of image registration tasks are computation-intensive, memory-intensive, and communication-intensive. Robust efforts are required on error recovery and re-usability of both the data and the operations, along with performance optimization. Considering these, we explore various programming models aiming to minimize the folding operations (such as join and reduce) which are the primary candidates of data shuffling, concurrency bugs and expensive communication in a distributed cluster. Particularly, we analyze modular MapReduce execution of an image registration pipeline (IRP) with the external and internal data (data-tunneling) flow mechanism and compare them with the compact model. Experimental analyzes with the ComputeCanada cluster and a crop field data-sets containing 1000 images show that these design options are valuable for large-scale IRPs executed with a MapReduce cluster. Additionally, we present an effectiveness measurement metric to analyze the impact of a design model for the Big IRP, accumulating the error-recovery and re-usability metrics along with the data size and execution time. Our explored design models and their performance analysis can serve as a benchmark for the researchers and application developers who deploy large-scale image registration and other image processing tasks.
An abundant number of clone detection tools have been proposed in the literature due to the many applications and benefits of clone detection. However, there has been difficulty in the performance evaluation and comparison of these clone detectors. This is due to a lack of reliable benchmarks, and the manual efforts required to validate a large number of candidate clones. In particular, there has been a lack of a synthetic benchmark that can precisely and comprehensively measure clone-detection recall. In this paper, we present a mutation-analysis based benchmarking framework that can be used not only to evaluate the recall of clone detection tools for different types of clones but also for specific kinds of clone edits and without any manual efforts. The framework uses an editing taxonomy of clone synthesis for generating thousands of artificial clones, injects into code bases and automatically evaluates the subject clone detection tools following the mutation analysis approach. Additionally, the framework has features where custom clone pairs could also be used in the framework for evaluating the subject tools. This gives the opportunity of evaluating specialized tools for specialized contexts such as evaluating a tool’s capability for the detection of complex Type-4 clones or real world clones without writing complex mutation operators for them. We demonstrate this framework by evaluating the performance of ten modern clone detection tools across two clone granularities (function and block) and three programming languages (Java, C and C#). Furthermore, we provide a variant of the framework that can be used to evaluate specialized tools such as for large gaped clone detection. Our experiments demonstrate confidence in the accuracy of our Mutation and Injection Framework when comparing against the expected results of the corresponding tools, and widely used real-world benchmarks such as Bellon’s benchmark and BigCloneBench. We provide features so that most clone detection tools that report clones in the form of clone pairs (either in filename/line numbers or filename/tokens) could be evaluated using the framework.
2020
Not only do newly proposed code clone detection techniques, but existing techniques and tools also need to be evaluated and compared. This evaluation process could be done by assessing the reported clones manually or by using benchmarks. The main limitations of available benchmarks include: they are restricted to one programming language; they have a limited number of clone pairs that are confined within the selected system(s); they require manual validation; they do not support all types of code clones. To overcome these limitations, we proposed a methodology to generate a wide range of semantic clone benchmark(s) for different programming languages with minimal human validation. Our technique is based on the knowledge provided by developers who participate in the crowd-sourced information website, Stack Overflow. We applied automatic filtering, selection and validation to the source code in Stack Overflow answers. Finally, we build a semantic code clone benchmark of 4000 clones pairs for the languages Java, C, C# and Python.
A code clone is defined as a pair of similar code fragments within a software system. While code clones are not always harmful, they can have a detrimental effect on the overall quality of a software system due to the propagation of bugs and other maintenance implications. Because of this, software developers need to analyse the code clones that exist in a software system. However, despite the availability of several clone detection systems, the adoption of such tools outside of the clone community remains low. A possible reason for this is the difficulty and complexity involved in setting up and using these tools. In this paper, we present Clone Swarm, a code clone analytics tool that identifies clones in a project and presents the information in an easily accessible manner. Clone Swarm is publicly available and can mine any open-sourced GIT repository. Clone Swarm internally uses NiCad, a popular clone detection tool in the cloud and lets users interactively explore code clones using a web-based interface at multiple granularity levels (Function and Block level). Clone results are visualized in multiple overviews, all the way from a high-level plot down to an individual line by line comparison view of cloned fragments. Also, to facilitate future research in the area of clone detection and analysis, users can directly download the clone detection results for their projects. Clone Swarm is available online at clone-swarm.usask.ca. The source code for Clone Swarm is freely available under the MIT license on GitHub.
Code reuse by copying and pasting from one place to another place in a codebase is a very common scenario in software development which is also one of the most typical reasons for introducing code clones. There is a huge availability of tools to detect such cloned fragments and a lot of studies have already been done for efficient clone detection. There are also several studies for evaluating those tools considering their clone detection effectiveness. Unfortunately, we find no study which compares different clone detection tools in the perspective of detecting cloned co-change candidates during software evolution. Detecting cloned co-change candidates is essential for clone tracking. In this study, we wanted to explore this dimension of code clone research. We used six promising clone detection tools to identify cloned and non-cloned co-change candidates from six $C$ and Java-based subject systems and evaluated the performance of those clone detection tools in detecting the cloned co-change fragments. Our findings show that a good clone detector may not perform well in detecting cloned co-change candidates. The amount of unique lines covered by a clone detector and the number of detected clone fragments plays an important role in its performance. The findings of this study can enrich a new dimension of code clone research.
The fork-based development mechanism provides the flexibility and the unified processes for software teams to collaborate easily in a distributed setting without too much coordination overhead.Currently, multiple social coding platforms support fork-based development, such as GitHub, GitLab, and Bitbucket. Although these different platforms virtually share the same features, they have different emphasis. As GitHub is the most popular platform and the corresponding data is publicly available, most of the current studies are focusing on GitHub hosted projects. However, we observed anecdote evidences that people are confused about choosing among these platforms, and some projects are migrating from one platform to another, and the reasons behind these activities remain unknown.With the advances of Software Heritage Graph Dataset (SWHGD),we have the opportunity to investigate the forking activities across platforms. In this paper, we conduct an exploratory study on 10popular open-source projects to identify cross-platform forks and investigate the motivation behind. Preliminary result shows that cross-platform forks do exist. For the 10 subject systems in this study, we found 81,357 forks in total among which 179 forks are on GitLab. Based on our qualitative analysis, we found that most of the cross-platform forks that we identified are mirrors of the repositories on another platform, but we still find cases that were created due to preference of using certain functionalities (e.g. Continuous Integration (CI)) supported by different platforms. This study lays the foundation of future research directions, such as understanding the differences between platforms and supporting cross-platform collaboration.
Scientific workflow management systems such as Galaxy, Taverna and Workspace, have been developed to automate scientific workflow management and are increasingly being used to accelerate the specification, execution, visualization, and monitoring of data-intensive tasks. For example, the popular bioinformatics platform Galaxy is installed on over 168 servers around the world and the social networking space myExperiment shares almost 4,000 Galaxy scientific workflows among its 10,665 members. Most of these systems offer graphical interfaces for composing workflows. However, while graphical languages are considered easier to use, graphical workflow models are more difficult to comprehend and maintain as they become larger and more complex. Text-based languages are considered harder to use but have the potential to provide a clean and concise expression of workflow even for large and complex workflows. A recent study showed that some scientists prefer script/text-based environments to perform complex scientific analysis with workflows. Unfortunately, such environments are unable to meet the needs of scientists who prefer graphical workflows. In order to address the needs of both types of scientists and at the same time to have script-based workflow models because of their underlying benefits, we propose a visually guided workflow modeling framework that combines interactive graphical user interface elements in an integrated development environment with the power of a domain-specific language to compose independently developed and loosely coupled services into workflows. Our domain-specific language provides scientists with a clean, concise, and abstract view of workflow to better support workflow modeling. As a proof of concept, we developed VizSciFlow, a generalized scientific workflow management system that can be customized for use in a variety of scientific domains. As a first use case, we configured and customized VizSciFlow for the bioinformatics domain. We conducted three user studies to assess its usability, expressiveness, efficiency, and flexibility. Results are promising, and in particular, our user studies show that VizSciFlow is more desirable for users to use than either Python or Galaxy for solving complex scientific problems.
Clone detection on large code repository is necessary for many big code analysis tasks. The goal is to provide rich information on identical and similar code across projects. Detecting near-miss code clones on big code is challenging since it requires intensive computing and memory resources as the scale of the source code increases. In this work, we propose SAGA, an efficient suffix-array based code clone detection tool designed with sophisticated GPU optimization. SAGA not only detects Type-l and Type-2 clones but also does so for cross-project large repositories and for the most computationally expensive Type-3 clones. Meanwhile, it also works at segment granularity, which is even more challenging. It detects code clones in 100 million lines of code within 11 minutes (with recall and precision comparable to state-of-the-art approaches), which is more than 10 times faster than state-of-the-art tools. It is the only tool that efficiently detects Type-3 near-miss clones at segment granularity in large code repository (e.g., within 11 hours on 1 billion lines of code). We conduct a preliminary case study on 85,202 GitHub Java projects with 1 billion lines of code and exhibit the distribution of clones across projects. We find about 1.23 million Type-3 clone groups, containing 28 million lines of code at arbitrary segment granularity, which are only detectable with SAGA. We believe SAGA is useful in many software engineering applications such as code provenance analysis, code completion, change impact analysis, and many more.
When a programmer makes changes to a target program entity (files, classes, methods), it is important to identify which other entities might also get impacted. These entities constitute the impact set for the target entity. Association rules have been widely used for discovering the impact sets. However, such rules only depend on the previous co-change history of the program entities ignoring the fact that similar entities might often need to be updated together consistently even if they did not co-change before. Considering this fact, we investigate whether cloning relationships among program entities can be associated with association rules to help us better identify the impact sets. In our research, we particularly investigate whether the impact set detection capability of a clone detector can be utilized to enhance the capability of the state-of-the-art association rule mining technique, Tarmaq, in discovering impact sets. We use the well known clone detector called NiCad in our investigation and consider both regular and micro-clones. Our evolutionary analysis on thousands of commit operations of eight diverse subject systems reveals that consideration of code clones can enhance the impact set detection accuracy of Tarmaq with a significantly higher precision and recall. Micro-clones of 3LOC and 4LOC and regular code clones of 5LOC to 20LOC contribute the most towards enhancing the detection accuracy.
Evolutionary coupling is a well investigated phenomenon during software evolution and maintenance. If two or more program entities co-change (i.e., change together) frequently during evolution, it is expected that the entities are coupled. This type of coupling is called evolutionary coupling or change coupling in the literature. Evolutionary coupling is realized using association rules and two measures: support and confidence. Association rules have been extensively used for predicting co-change candidates for a target program entity (i.e., an entity that a programmer attempts to change). However, association rules often predict a large number of co-change candidates with many false positives. Thus, it is important to rank the predicted co-change candidates so that the true positives get higher priorities. The predicted co-change candidates have always been ranked using the support and confidence measures of the association rules. In our research, we investigate five different ranking mechanisms on thousands of commits of ten diverse subject systems. On the basis of our findings, we propose a history-based ranking approach, HistoRank (History-based Ranking), that analyzes the previous ranking history to dynamically select the most appropriate one from those five ranking mechanisms for ranking co-change candidates of a target program entity. According to our experiment result, HistoRank outperforms each individual ranking mechanism with a significantly better MAP (mean average precision). We investigate different variants of HistoRank and realize that the variant that emphasizes the ranking in the most recent occurrence of co-change in the history performs the best.
Developers often prefer dynamically typed programming languages, such as JavaScript, because such languages do not require explicit type declarations. However, such a feature hinders software engineering tasks, such as code completion, type related bug fixes and so on. Deep learning-based techniques are proposed in the literature to infer the types of code elements in JavaScript snippets. These techniques are computationally expensive. While several type inference techniques have been developed to detect types in code snippets written in statically typed languages, it is not clear how effective those techniques are for inferring types in dynamically typed languages, such as JavaScript. In this paper, we investigate the type inference techniques of JavaScript to understand the above two issues further. While doing that we propose a new technique that considers the locally specific code tokens as the context to infer the types of code elements. The evaluation result shows that the proposed technique is 20-47% more accurate than the statically typed language-based techniques and 5–14 times faster than the deep learning techniques without sacrificing accuracy. Our analysis of sensitivity, overlapping of predicted types and the number of training examples justify the importance of our technique.
Code clones are the same or nearly similar code fragments in a software system's code-base. While the existing studies have extensively studied regular code clones in software systems, micro-clones have been mostly ignored. Although an existing study investigated consistent changes in exact micro-clones, near-miss micro-clones have never been investigated. In our study, we investigate the importance of near-miss micro-clones in software evolution and maintenance by automatically detecting and analyzing the consistent updates that they experienced during the whole period of evolution of our subject systems. We compare the consistent co-change tendency of near-miss micro-clones with that of exact micro-clones and regular code clones. According to our investigation on thousands of revisions of six open-source subject systems written in two different programming languages, near-miss micro-clones have a significantly higher tendency of experiencing consistent updates compared to exact micro-clones and regular (both exact and near-miss) code clones. Consistent updates in near-miss micro-clones have a high tendency of being related with bug-fixes. Moreover, the percentage of commit operations where near-miss micro-clones experience consistent updates is considerably higher than that of regular clones and exact micro-clones. We finally observe that near-miss micro-clones staying in close proximity to each other have a high tendency of experiencing consistent updates. Our research implies that near-miss micro-clones should be considered equally important as of regular clones and exact micro-clones when making clone management decisions.
Abstract Code clones, identical or nearly similar code fragments in a software system’s code-base, have mixed impacts on software evolution and maintenance. Focusing on the issues of clones researchers suggest managing them through refactoring, and tracking. In this paper we present a survey on the state-of-the-art of clone refactoring and tracking techniques, and identify future research possibilities in these areas. We define the quality assessment features for the clone refactoring and tracking tools, and make a comparison among these tools considering these features. To the best of our knowledge, our survey is the first comprehensive study on clone refactoring and tracking. According to our survey on clone refactoring we realize that automatic refactoring cannot eradicate the necessity of manual effort regarding finding refactoring opportunities, and post refactoring testing of system behaviour. Post refactoring testing can require a significant amount of time and effort from the quality assurance engineers. There is a marked lack of research on the effect of clone refactoring on system performance. Future investigations in this direction will add much value to clone refactoring research. We also feel the necessity of future research towards real-time detection, and tracking of code clones in a big-data environment.
Abstract While there are novel approaches for detecting and categorizing similar software applications, previous research focused on detecting similarity in applications written in the same programming language and not on detecting similarity in applications written in different programming languages. Cross-language software similarity detection is inherently more challenging due to variations in language, application structures, support libraries used, and naming conventions. In this paper we propose a novel model, CroLSim, to detect similar software applications across different programming languages. We define a semantic relationship among cross-language libraries and API methods (both local and third party) using functional descriptions and a word-vector learning model. Our experiments show that CroLSim can successfully detect cross-language similar software applications, which outperforms all existing approaches (mean average precision rate of 0.65, confidence rate of 3.6, and 75% highly rated successful queries). Furthermore, we applied CroLSim to a source code repository to see whether our model can recommend cross-language source code fragments if queried directly with source code. From our experiments we found that CroLSim can recommend cross-language functional similar source code when source code is directly used as a query (average precision=0.28, recall=0.85, and F-Measure=0.40).
Abstract Context APIs play a central role in software development. The seminal research of Carroll et al. [15] on minimal manual and subsequent studies by Shull et al. [79] showed that developers prefer task-based API documentation instead of traditional hierarchical official documentation (e.g., Javadoc). The Q&A format in Stack Overflow offers developers an interface to ask and answer questions related to their development tasks. Objective With a view to produce API documentation, we study automated techniques to mine API usage scenarios from Stack Overflow. Method We propose a framework to mine API usage scenarios from Stack Overflow. Each task consists of a code example, the task description, and the reactions of developers towards the code example. First, we present an algorithm to automatically link a code example in a forum post to an API mentioned in the textual contents of the forum post. Second, we generate a natural language description of the task by summarizing the discussions around the code example. Third, we automatically associate developers reactions (i.e., positive and negative opinions) towards the code example to offer information about code quality. Results We evaluate the algorithms using three benchmarks. We compared the algorithms against seven baselines. Our algorithms outperformed each baseline. We developed an online tool by automatically mining API usage scenarios from Stack Overflow. A user study of 31 software developers shows that the participants preferred the mined usage scenarios in Opiner over API official documentation. The tool is available online at: http://opiner.polymtl.ca/ . Conclusion With a view to produce API documentation, we propose a framework to automatically mine API usage scenarios from Stack Overflow, supported by three novel algorithms. We evaluated the algorithms against a total of eight state of the art baselines. We implement and deploy the framework in our proof-of-concept online tool, Opiner.
To detect large-variance code clones (i.e. clones with relatively more differences) in large-scale code repositories is difficult because most current tools can only detect almost identical or very similar clones. It will make promotion and changes to some software applications such as bug detection, code completion, software analysis, etc. Recently, CCAligner made an attempt to detect clones with relatively concentrated modifications called large-gap clones. Our contribution is to develop a novel and effective detection approach of large-variance clones to more general cases for not only the concentrated code modifications but also the scattered code modifications. A detector named LVMapper is proposed, borrowing and changing the approach of sequencing alignment in bioinformatics which can find two similar sequences with more differences. The ability of LVMapper was tested on both self-synthetic datasets and real cases, and the results show substantial improvement in detecting large-variance clones compared with other state-of-the-art tools including CCAligner. Furthermore, our new tool also presents good recall and precision for general Type-1, Type-2 and Type-3 clones on the widely used benchmarking dataset, BigCloneBench.
Developers often search for relevant code examples on the web for their programming tasks. Unfortunately, they face three major problems. First, they frequently need to read and analyse multiple results from the search engines to obtain a satisfactory solution. Second, the search is impaired due to a lexical gap between the query (task description) and the information associated with the solution (e.g., code example). Third, the retrieved solution may not be comprehensible, i.e., the code segment might miss a succinct explanation. To address these three problems, we propose CROKAGE (CrowdKnowledge Answer Generator), a tool that takes the description of a programming task (the query) as input and delivers a comprehensible solution for the task. Our solutions contain not only relevant code examples but also their succinct explanations written by human developers. The search for code examples is modeled as an Information Retrieval (IR) problem. We first leverage the crowd knowledge stored in Stack Overflow to retrieve the candidate answers against a programming task. For this, we use a fine-tuned IR technique, chosen after comparing 11 IR techniques in terms of performance. Then we use a multi-factor relevance mechanism to mitigate the lexical gap problem, and select the top quality answers related to the task. Finally, we perform natural language processing on the top quality answers and deliver the comprehensible solutions containing both code examples and code explanations unlike earlier studies. We evaluate and compare our approach against ten baselines, including the state-of-art. We show that CROKAGE outperforms the ten baselines in suggesting relevant solutions for 902 programming tasks (i.e., queries) of three popular programming languages: Java, Python and PHP. Furthermore, we use 24 programming tasks (queries) to evaluate our solutions with 29 developers and confirm that CROKAGE outperforms the state-of-art tool in terms of relevance of the suggested code examples, benefit of the code explanations and the overall solution quality (code + explanation).
Abstract A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, several code clone detection techniques and tools have been proposed and studied over the last decade. However, the clone detection tools are not always perfect and their clone detection reports often contain a number of false positives or irrelevant clones from specific project management or user perspective. To detect all possible similar source code patterns in general, the clone detection tools work on the syntax level while lacking user-specific preferences. This often means the clones must be manually inspected before analysis in order to remove those false positives from consideration. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning approach for automating the validation process. First, a training dataset is built by taking code clones from several clone detection tools for different subject systems and then manually validating those clones. Second, several features are extracted from those clones to train the machine learning model by the proposed approach. The trained algorithm is then used to automatically validate clones without human inspection. Thus the proposed approach can be used to remove the false positive clones from the detection results, automatically evaluate the precision of any clone detectors for any given set of datasets, evaluate existing clone benchmark datasets, or even be used to build new clone benchmarks and datasets with minimum effort. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method also shows better results in several comparative studies with the existing related approaches for clone classification.
2019
The design and maintenance of APIs (Application Programming Interfaces) are complex tasks due to the constantly changing requirements of their users. Despite the efforts of their designers, APIs may suffer from a number of issues (such as incomplete or erroneous documentation, poor performance, and backward incompatibility). To maintain a healthy client base, API designers must learn these issues to fix them. Question answering sites, such as Stack Overflow (SO), have become a popular place for discussing API issues. These posts about API issues are invaluable to API designers, not only because they can help to learn more about the problem but also because they can facilitate learning the requirements of API users. However, the unstructured nature of posts and the abundance of non-issue posts make the task of detecting SO posts concerning API issues difficult and challenging. In this paper, we first develop a supervised learning approach using a Conditional Random Field (CRF), a statistical modeling method, to identify API issue-related sentences. We use the above information together with different features collected from posts, the experience of users, readability metrics and centrality measures of collaboration network to build a technique, called CAPS, that can classify SO posts concerning API issues. In total, we consider 34 features along eight different dimensions. Evaluation of CAPS using carefully curated SO posts on three popular API types reveals that the technique outperforms all three baseline approaches we consider in this study. We then conduct studies to find important features and also evaluate the performance of the CRF-based technique for classifying issue sentences. Comparison with two other baseline approaches shows that the technique has high potential. We also test the generalizability of CAPS results, evaluate the effectiveness of different classifiers, and identify the impact of different feature sets.
Copying and pasting source code during software development is known as code cloning. Clone fragments with a minimum size of 5 LOC were usually considered in previous studies. In recent studies, clone fragments which are less than 5 LOC are referred as micro-clones. It has been established by the literature that code clones are closely related with software bugs as well as bug replication. None of the previous studies have been conducted on bug-replication of micro-clones. In this paper we investigate and compare bug-replication in between regular and micro-clones. For the purpose of our investigation, we analyze the evolutionary history of our subject systems and identify occurrences of similarity preserving co-changes (SPCOs) in both regular and micro-clones where they experienced bug-fixes. From our experiment on thousands of revisions of six diverse subject systems written in three different programming languages, C, C# and Java we find that the percentage of clone fragments that take part in bug-replication is often higher in micro-clones than in regular code clones. The percentage of bugs that get replicated in micro-clones is almost the same as the percentage in regular clones. Finally, both regular and micro-clones have similar tendencies of replicating severe bugs according to our experiment. Thus, micro-clones in a code-base should not be ignored. We should rather consider these equally important as of the regular clones when making clone management decisions.
Developers often search for relevant code examples on the web for their programming tasks. Unfortunately, they face two major problems. First, the search is impaired due to a lexical gap between their query (task description) and the information associated with the solution. Second, the retrieved solution may not be comprehensive, i.e., the code segment might miss a succinct explanation. These problems make the developers browse dozens of documents in order to synthesize an appropriate solution. To address these two problems, we propose CROKAGE (Crowd Knowledge Answer Generator), a tool that takes the description of a programming task (the query) and provides a comprehensive solution for the task. Our solutions contain not only relevant code examples but also their succinct explanations. Our proposed approach expands the task description with relevant API classes from Stack Overflow Q & A threads and then mitigates the lexical gap problems. Furthermore, we perform natural language processing on the top quality answers and then return such programming solutions containing code examples and code explanations unlike earlier studies. We evaluate our approach using 97 programming queries, of which 50% was used for training and 50% was used for testing, and show that it outperforms six baselines including the state-of-art by a statistically significant margin. Furthermore, our evaluation with 29 developers using 24 tasks (queries) confirms the superiority of CROKAGE over the state-of-art tool in terms of relevance of the suggested code examples, benefit of the code explanations and the overall solution quality (code + explanation).
In modern days, mobile applications (apps) have become omnipresent. Components of mobile apps (such as 3rd party libraries) require to be separated and analyzed differently for security issue detection, repackaged app detection, tumor code purification and so on. Various techniques are available to automatically analyze mobile apps. However, analysis of the app's executable binary remains challenging due to required curated database, large codebases and obfuscation. Considering these, we focus on exploring a versatile technique to separate different components with design-based features independent of code obfuscation. Particularly, we conducted an empirical study using design patterns and fuzzy signatures to separate app components such as 3rd party libraries. In doing so, we built a system for automatically extracting design patterns from both the executable package (APK) and Jar of an Android application. The experimental results with various standard datasets containing 3rd party libraries, obfuscated apps and malwares reveal that design features like these are present significantly within them (within 60% APKs including malware). Moreover, these features remain unaltered even after app obfuscation. Finally, as a case study, we found that the design patterns alone can detect 3rd party libraries within the obfuscated apps considerably (F1 score is 32%). Overall, our empirical study reveals that design features might play a versatile role in separating various Android components for various purposes.
Abstract With the era of big data approaching, the number of software systems, their dependencies, as well as the complexity of the individual system is becoming larger and more intricate. Understanding these evolving software systems is thus a primary challenge for cost-effective software management and maintenance. In this paper we perform a case study with evolving code clones. The programmers often need to manually analyze the co-evolution of clone fragments to decide about refactoring, tracking, and bug removal. However, manual analysis is time consuming, and nearly infeasible for a large number of clones, e.g., with millions of similarity pairs, where clones are evolving over hundreds of software revisions. We propose an interactive visual analytics system, Clone-World, which leverages big data visualization approach to manage code clones in large software systems. Clone-World, gives an intuitive yet powerful solution to the clone analytic problems. Clone-World combines multiple information-linked zoomable views, where users can explore and analyze clones through interactive exploration in real time. User studies and experts’ reviews suggest that Clone-World may assist developers in many real-life software development and maintenance scenarios. We believe that Clone-World will ease the management and maintenance of clones, and inspire future innovation to adapt visual analytics to manage big software systems.
Abstract Code clones are identical or nearly similar code fragments in a code-base. According to the existing studies, code clones are directly related to bugs. Code cloning, creating code clones, is suspected to propagate temporarily hidden bugs from one code fragment to another. However, there is no study on the intensity of bug-propagation through code cloning. In this paper, we define two clone evolutionary patterns that reasonably indicate bug propagation through code cloning. By analyzing software evolution history, we identify those code clones that evolved following the bug propagation patterns. According to our study on thousands of commits of seven subject systems, overall 18.42% of the clone fragments that experience bug-fixes contain propagated bugs. Type-3 clones are primarily involved with bug-propagation. Bug propagation is more likely to occur in the clone fragments that are created in the same commit rather than in different commits. Moreover, code clones residing in the same file have a higher possibility of containing propagated bugs compared to those residing in different files. Severe bugs can sometimes get propagated through code cloning. Automatic support for immediately identifying occurrences of bug-propagation can be beneficial for software maintenance. Our findings are important for prioritizing code clones for management.
The identical or nearly similar code fragments in a code-base are called code clones. There is a common belief that code cloning (copy/pasting code fragments) can introduce bugs in a software system if the copied code fragments are not properly adapted to their contexts (i.e., surrounding code). However, none of the existing studies have investigated whether such bugs are really present in code clones. We denote these bugs as Context Adaptation Bugs, or simply Context-Bugs, in our paper and investigate the extent to which they can be present in code clones. We define and automatically analyze two clone evolutionary patterns that indicate fixing of Context-Bugs. According to our analysis on thousands of revisions of six open-source subject systems written in Java, C, and C#, code cloning often introduces Context-Bugs in software systems. Around 50% of the clone related bug-fixes can occur for fixing Context-Bugs. Cloning (copy/pasting) a newly created code fragment (i.e., a code fragment that was not added in a former revision) is more likely to introduce Context-Bugs compared to cloning a preexisting fragment (i.e., a code fragment that was added in a former revision). Moreover, cloning across different files appears to have a significantly higher tendency of introducing Context-Bugs compared to cloning within the same file. Finally, Type 3 clones (gapped clones) have the highest tendency of containing Context-Bugs among the three major clone-types. Our findings can be important for early detection as well as removal of Context-Bugs in code clones.
Scientific Workflow Management Systems (SWfMSs) have become popular for accelerating the specification, execution, visualization, and monitoring of data-intensive scientific experiments. Unfortunately, to the best of our knowledge no existing SWfMSs directly support collaboration. Data is increasing in complexity, dimensionality, and volume, and the efficient analysis of data often goes beyond the realm of an individual and requires collaboration with multiple researchers from varying domains. In this paper, we propose a groupware system architecture for data analysis that in addition to supporting collaboration, also incorporates features from SWfMSs to support modern data analysis processes. As a proof of concept for the proposed architecture we developed SciWorCS - a groupware system for scientific data analysis. We present two real-world use-cases: collaborative software repository analysis and bioinformatics data analysis. The results of the experiments evaluating the proposed system are promising. Our bioinformatics user study demonstrates that SciWorCS can leverage real-world data analysis tasks by supporting real-time collaboration among users.
A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the original source codes. The generalization often results in returning code fragments that are only coincidentally similar and not considered clones by users, and hence requires manual validation of the reported possible clones by users which is often both time-consuming and challenging. In this paper, we propose a machine learning based tool 'CloneCognition' (Open Source Codes: https://github.com/pseudoPixels/CloneCognition ; Video Demonstration: https://www.youtube.com/watch?v=KYQjmdr8rsw) to automate the laborious manual validation process. The tool runs on top of any code clone detection tools to facilitate the clone validation process. The tool shows promising clone classification performance with an accuracy of up to 87.4%. The tool also exhibits significant improvement in the results when compared with state-of-the-art techniques for code clone validation.
Software clones are detrimental to software maintenance and evolution and as a result many clone detectors have been proposed. These tools target clone detection in software applications written in a single programming language. However, a software application may be written in different languages for different platforms to improve the application's platform compatibility and adoption by users of different platforms. Cross language clones (CLCs) introduce additional challenges when maintaining multi-platform applications and would likely go undetected using existing tools. In this paper, we propose CLCDSA, a cross language clone detector which can detect CLCs without extensive processing of the source code and without the need to generate an intermediate representation. The proposed CLCDSA model analyzes different syntactic features of source code across different programming languages to detect CLCs. To support large scale clone detection, the CLCDSA model uses an action filter based on cross language API call similarity to discard non-potential clones. The design methodology of CLCDSA is two-fold: (a) it detects CLCs on the fly by comparing the similarity of features, and (b) it uses a deep neural network based feature vector learning model to learn the features and detect CLCs. Early evaluation of the model observed an average precision, recall and F-measure score of 0.55, 0.86, and 0.64 respectively for the first phase and 0.61, 0.93, and 0.71 respectively for the second phase which indicates that CLCDSA outperforms all available models in detecting cross language clones.
Developers often reuse code snippets from online forums, such as Stack Overflow, to learn API usages of software frameworks or libraries. These code snippets often contain ambiguous undeclared external references. Such external references make it difficult to learn and use those APIs correctly. In particular, reusing code snippets containing such ambiguous undeclared external references requires significant manual efforts and expertise to resolve them. Manually resolving fully qualified names (FQN) of API elements is a non-trivial task. In this paper, we propose a novel context-sensitive technique, called COSTER, to resolve FQNs of API elements in such code snippets. The proposed technique collects locally specific source code elements as well as globally related tokens as the context of FQNs, calculates likelihood scores, and builds an occurrence likelihood dictionary (OLD). Given an API element as a query, COSTER captures the context of the query API element, matches that with the FQNs of API elements stored in the OLD, and rank those matched FQNs leveraging three different scores: likelihood, context similarity, and name similarity scores. Evaluation with more than 600K code examples collected from GitHub and two different Stack Overflow datasets shows that our proposed technique improves precision by 4-6% and recall by 3-22% compared to state-of-the-art techniques. The proposed technique significantly reduces the training time compared to the StatType, a state-of-the-art technique, without sacrificing accuracy. Extensive analyses on results demonstrate the robustness of the proposed technique.
Scientific workflow management system (SWFMS) is one of the inherent parts of Big Data analytics systems. Analyses in such data intensive research using workflows are very costly. SWFMSs or workflows keep track of every bit of executions through logs, which later could be used on demand. For example, in the case of errors, security breaches, or even any conditions, we may need to trace back to the previous steps or look at the intermediate data elements. Such fashion of logging is known as workflow provenance. However, prominent workflows being domain specific and developed following different programming paradigms, their architectures, logging mechanisms, information in the logs, provenance queries, and so on differ significantly. So, provenance technology of one workflow from a certain domain is not easily applicable in another domain. Facing the lack of a general workflow provenance standard, we propose a programming model for automated workflow logging. The programming model is easy to implement and easily configurable by domain experts independent of workflow users. We implement our workflow programming model on Bioinformatics research—for evaluation and collect workflow logs from various scientific pipelines’ executions. Then we focus on some fundamental provenance questions inspired by recent literature that can derive many other complex provenance questions. Finally, the end users are provided with discovered insights from the workflow provenance through online data visualization as a separate web service.
Big Data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text, and sensor data). These systems offer a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from different studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big Data management for micro-level modular computation-intensive programs in a Spark and Hadoop-based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.
2018
Workflows are frequently built and used to systematically process large datasets using workflow management systems (WMS). A workflow (i.e., a pipeline) is a finite set of processing modules organized as a series of steps that is applied to an input dataset to produce a desired output. In a workflow management system, users generally create workflows manually for their own investigations. However, workflows can sometimes be lengthy and the constituent processing modules might often be computationally expensive. In this situation, it would be beneficial if users could reuse intermediate stage results generated by previously executed workflows for executing their current workflow.In this paper, we propose a novel technique based on association rule mining for suggesting which intermediate stage results from a workflow that a user is going to execute should be stored for reusing in the future. We call our proposed technique, RISP (Recommending Intermediate States from Pipelines). According to our investigation on hundreds of workflows from two scientific workflow management systems, our proposed technique can efficiently suggest intermediate state results to store for future reuse. The results that are suggested to be stored have a high reuse frequency. Moreover, for creating around 51% of the entire pipelines, we can reuse results suggested by our technique. Finally, we can achieve a considerable gain (74% gain) in execution time by reusing intermediate results stored by the suggestions provided by our proposed technique. We believe that our technique (RISP) has the potential to have a significant positive impact on Big-Data systems, because it can considerably reduce execution time of the workflows through appropriate reuse of intermediate state results, and hence, can improve the performance of the systems.
If two or more program entities (such as files, classes, methods) co-change (i.e., change together) frequently during software evolution, then it is likely that these two entities are coupled (i.e., the entities are related). Such a coupling is termed as evolutionary coupling in the literature. The concept of traditional evolutionary coupling restricts us to assume coupling among only those entities that changed together in the past. The entities that did not co-change in the past might also have coupling. However, such couplings can not be retrieved using the current concept of detecting evolutionary coupling in the literature. In this paper, we investigate whether we can detect such couplings by applying transitive rules on the evolutionary couplings detected using the traditional mechanism. We call these couplings that we detect using our proposed mechanism as transitive evolutionary couplings. According to our research on thousands of revisions of four subject systems, transitive evolutionary couplings combined with the traditional ones provide us with 13.96% higher recall and 5.56% higher precision in detecting future co-change candidates when compared with a state-of-the-art technique.
A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.
In today's open source era, developers look forsimilar software applications in source code repositories for anumber of reasons, including, exploring alternative implementations, reusing source code, or looking for a better application. However, while there are a great many studies for finding similarapplications written in the same programming language, there isa marked lack of studies for finding similar software applicationswritten in different languages. In this paper, we fill the gapby proposing a novel modelCroLSimwhich is able to detectsimilar software applications across different programming lan-guages. In our approach, we use the API documentation tofind relationships among the API calls used by the differentprogramming languages. We adopt a deep learning based word-vector learning method to identify semantic relationships amongthe API documentation which we then use to detect cross-language similar software applications. For evaluating CroLSim, we formed a repository consisting of 8,956 Java, 7,658 C#, and 10,232 Python applications collected from GitHub. Weobserved thatCroLSimcan successfully detect similar softwareapplications across different programming languages with a meanaverage precision rate of 0.65, an average confidence rate of3.6 (out of 5) with 75% high rated successful queries, whichoutperforms all related existing approaches with a significantperformance improvement.
Scientific Workflow Management Systems are being widely used in recent years for data-intensive analysis tasks or domain-specific discoveries. It often becomes challenging for an individual to effectively analyze the large scale scientific data of relatively higher complexity and dimensions, and requires a collaboration of multiple members of different disciplines. Hence, researchers have focused on designing collaborative workflow management systems. However, consistency management in the face of conflicting concurrent operations of the collaborators is a major challenge in such systems. In this paper, we propose a locking scheme (e.g., collaborator gets write access to non-conflicting components of the workflow at a given time) to facilitate consistency management in collaborative scientific workflow management systems. The proposed method allows locking workflow components at a granular level in addition to supporting locks on a targeted part of the collaborative workflow. We conducted several experiments to analyze the performance of the proposed method in comparison to related existing methods. Our studies show that the proposed method can reduce the average waiting time of a collaborator by up to 36.19% in comparison to existing descendent modular level locking techniques for collaborative scientific workflow management systems.
Detection, tracking, and refactoring of code clones (i.e., identical or nearly similar code fragments in the code-base of a software system) have been extensively investigated by a great many studies. Code clones have often been considered bad smells. While clone refactoring is important for removing code clones from the code-base, clone tracking is important for consistently updating code clones that are not suitable for refactoring. In this research we investigate the importance of micro-clones (i.e., code clones of less than five lines of code) in consistent updating of the code-base. While the existing clone detectors and trackers have ignored micro clones, our investigation on thousands of commits from six subject systems imply that around 80% of all consistent updates during system evolution occur in micro clones. The percentage of consistent updates occurring in micro clones is significantly higher than that in regular clones according to our statistical significance tests. Also, the consistent updates occurring in micro-clones can be up to 23% of all updates during the whole period of evolution. According to our manual analysis, around 83% of the consistent updates in micro-clones are non-trivial. As micro-clones also require consistent updates like the regular clones, tracking or refactoring micro-clones can help us considerably minimize effort for consistently updating such clones. Thus, micro-clones should also be taken into proper consideration when making clone management decisions.
The design and maintenance of APIs are complex tasks due to the constantly changing requirements of its users. Despite the efforts of its designers, APIs may suffer from a number of issues (such as incomplete or erroneous documentation, poor performance, and backward incompatibility). To maintain a healthy client base, API designers must learn these issues to fix them. Question answering sites, such as Stack Overflow (SO), has become a popular place for discussing API issues. These posts about API issues are invaluable to API designers, not only because they can help to learn more about the problem but also because they can facilitate learning the requirements of API users. However, the unstructured nature of posts and the abundance of non-issue posts make the task of detecting SO posts concerning API issues difficult and challenging. In this paper, we first develop a supervised learning approach using a Conditional Random Field (CRF), a statistical modeling method, to identify API issue-related sentences. We use the above information together with different features of posts and experience of users to build a technique, called CAPS, that can classify SO posts concerning API issues. Evaluation of CAPS using carefully curated SO posts on three popular API types reveals that the technique outperforms all three baseline approaches we consider in this study. We also conduct studies to test the generalizability of CAPS results and to understand the effects of different sources of information on it.
Copying code and then pasting with large number of edits is a common activity in software development, and the pasted code is a kind of complicated Type-3 clone. Due to large number of edits, we consider the clone as a large-gap clone. Large-gap clone can reflect the extension of code, such as change and improvement. The existing state-of-the-art clone detectors suffer from several limitations in detecting large-gap clones. In this paper, we propose a tool, CCAligner, using code window that considers e edit distance for matching to detect large-gap clones. In our approach, a novel e-mismatch index is designed and the asymmetric similarity coefficient is used for similarity measure. We thoroughly evaluate CCAligner both for large-gap clone detection, and for general Type-1, Type-2 and Type-3 clone detection. The results show that CCAligner performs better than other competing tools in large-gap clone detection, and has the best execution time for 10MLOC input with good precision and recall in general Type-1 to Type-3 clone detection. Compared with existing state-of-the-art tools, CCAligner is the best performing large-gap clone detection tool, and remains competitive with the best clone detectors in general Type-1, Type-2 and Type-3 clone detection.